的grep-ING变量对一个文件 - 执行时间

做了一个有趣的观察 - 我被存储卷曲语句的输出文本文件，然后用grep-ING它某些字符串。后来我改变我的代码来存储输出到一个变量。事实证明，这种改变导致我的脚本运行速度变慢。这对我来说非常直观，因为我一直认为I/O操作比内存操作更昂贵。以下是代码：的grep-ING变量对一个文件 - 执行时间

#!/bin/bash 
URL="http://m.cnbc.com" 
while read line; do 
    UA=$line 
    curl -s --location --user-agent "$UA" $URL > RAW.txt 
    #RAW=`curl --location --user-agent "$UA" $URL` 
    L=`grep -c -e "Advertise With Us" RAW.txt` 
    #L=`echo $RAW | grep -c -e "Advertise With Us"` 
    M=`grep -c -e "id='menu'><button>Menu</button>" RAW.txt` 
    #M=`echo $RAW | grep -c -e "id='menu'><button>Menu</button>"` 
    D=`grep -c -e "Careers" RAW.txt` 
    #D=`echo $RAW | grep -c -e "Careers"` 
    if [[ ($L == 1 && $M == 0) && ($D == 0) ]] 
    then 
     AC="Legacy" 
    elif [[ ($L == 0 && $M == 1) && ($D == 0) ]] 
    then 
    AC="Modern" 
    elif [[ ($L == 0 && $M == 0) && ($D == 1) ]] 
    then 
     AC="Desktop" 
    else 
    AC="Unable to Determine" 
    fi 
    echo $AC >> Results.txt 
done < UserAgents.txt

注释行表示变量存储方法。任何想法为什么会发生这种情况？还有什么方法可以进一步加速这个脚本？现在处理2000个输入条目大约需要8分钟。

来源

2013-04-25 Ravi Gupta

在原来的版本，'RAW.txt'可能适应缓存，所以你不付的I/O处罚连续调用'就可以了grep'。在您的“优化”的版本，你都归因于饲料每次调用'grep'管道增加，你需要到餐桌的进程数。不过要记住，如果你想要速度，为2000线中的每一条分出几个过程是错误的。 – chepner 2013-04-25 12:27:45

Chepner是正确的。阅读每次调用cURL只有一次，每个标记三个所需的字符串。以下是使用awk的一些示例代码。完全未经测试：

URL="http://m.cnbc.com" 
while IFS= read -r line; do 
    RAW=$(curl --location --user-agent "$line" $URL) 

    awk ' 
    /Advertise With Us/ { 
     L=1 
    } 
    /id='\''menu'\''><button>Menu<\/button>/ { 
     M=1 
    } 
    /Careers/ { 
     D=1 
    } 

    END { 
     if (L==1 && M==0 && D==0) { 
      s = "Legacy" 
     } 
     else if (L==0 && M==1 && D==0) { 
      s = "Modern" 
     } 
     else if (L==0 && M==0 && D==1) { 
      s = "Desktop" 
     } 
     else { 
      s = "Unable to Determine" 
     } 

     print s >> "Results.txt" 
    }' "$RAW" 

done < UserAgents.txt

来源

2013-04-25 13:46:13 Steve

您是否真的需要计算与grep -c匹配的数量？看起来你只需要知道是否找到了比赛。如果是这样，你可以简单地使用bash的内置字符串比较。

此外，如果您写信给外循环的结果文件时，它会更快。

尝试以下操作：

#!/bin/bash 
URL="http://m.cnbc.com" 
while read line 
do 
    UA="$line" 
    RAW=$(curl -s --location --user-agent "$UA" "$URL") 
    [[ $RAW == *"Advertise With Us"* ]] && L=1 || L=0 
    [[ $RAW == *"id='menu'><button>Menu</button>"* ]] && M=1 || M=0 
    [[ $RAW == *Careers* ]] && D=1 || D=0 

    if ((L==1 && M==0 && D==0)) 
    then 
    AC="Legacy" 
    elif ((L==1 && M==1 && D==0)) 
    then 
    AC="Modern" 
    elif ((L==1 && M==0 && D==1)) 
    then 
    AC="Desktop" 
    else 
    AC="Unable to Determine" 
    fi 
    echo "$AC" 
done <UserAgents.txt> Results.txt

来源

2013-04-25 15:07:02 dogbane

@dogbaneDo你真的需要计数使用grep -c匹配的数量？看起来你只需要知道是否找到了比赛。 – 2013-04-25 19:22:16

的grep-ING变量对一个文件 - 执行时间

回答

相关问题