bash脚本优化

这是有问题的脚本：bash脚本优化

for file in `ls products` 
do 
    echo -n `cat products/$file \ 
    | grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \ 
    | head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'` 
done

我要上50000+的文件，这将需要大约12小时，此脚本运行。

的算法如下：

查找表含有细胞（<td>）仅行不包含任何“IMG”，“href”属性，或“输入”的。
选择其中的第一个，然后提取标签之间的数据。

通常的bash文本过滤器（sed，grep，awk等）以及perl都可用。

来源

2011-05-05 Marko

如果您不打算执行这个操作不是一次或两次以上，如果它需要1/2一天跑谁在乎呢？如果你花2个小时对其进行优化，只能获得1小时的速度提升......这是否值得呢？ – cdeszaq 2011-05-05 19:29:04

@cdeszaq：我还有其他四个类似的脚本，我相信一旦我看到这个优化的脚本，我就可以优化它。 – Marko 2011-05-05 19:34:47

貌似可以全部由一个gawk的命令来替换：

gawk ' 
    /<td>.*<\/td>/ && !(/img/ || /href/ || /input/) { 
     sub(/^ *<td>/,""); sub(/<.*/,"") 
     print 
     nextfile 
    } 
' products/*

此用途gawk扩展nextfile。

如果通配符膨胀过大，那么

find products -type f -print | xargs gawk '...'

来源

2011-05-05 19:58:16

+1非常好 – hmontoliu 2011-05-05 20:39:41

下面是一些快速perl来做整个事情，应该更快。

#!/usr/bin/perl 

process_files($ARGV[0]); 

# process each file in the supplied directory 
sub process_files($) 
{ 
    my $dirpath = shift; 
    my $dh; 
    opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!"; 
    # get a list of files 
    my @files; 
    do { 
    @files = readdir($dh); 
    foreach my $ent (@files){ 
     if (-f "$dirpath/$ent"){ 
     get_first_text_cell("$dirpath/$ent"); 
     } 
    } 
    } while ($#files > 0); 
    closedir($dh); 
} 

# return the content of the first html table cell 
# that does not contain img,href or input tags 
sub get_first_text_cell($) 
{ 
    my $filename = shift; 
    my $fh; 
    open($fh,"<$filename") or die "Cant open $filename. $!"; 
    my $found = 0; 
    while ((my $line = <$fh>) && ($found == 0)){ 
    ## capture html and text inside a table cell 
    if ($line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i){ 
     my $cell = $1; 

     ## omit anything with the following tags 
     if ($cell !~ /<(img|href|input)/){ 
     $found++; 
     print "$cell\n"; 
     } 
    } 
    } 
    close($fh); 
}

只需通过将目录调用它要搜索的第一个参数：

$ perl parse.pl /html/documents/

来源

2011-05-05 19:41:33 IanNorton

在我的系统上运行这个包含1000个文件的测试集，它只需不到一秒钟的时间。 – IanNorton 2011-05-05 19:53:57

（应该是更快，更清晰）这个是什么：

for file in products/*; do 
    grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)' 
done

的for将目光中的每个文件products 。 查看与您的语法的区别。
第一个grep将只输出<td>和</td>之间的文本，只要每个单元格在一行中就没有这些标签。
终于第二grep将输出只是第一线（这是什么，我相信你想与head -1来实现）不包含img，href或input（将正确的出口则减少了这些线总的时间允许更快地处理下一个文件）

我会喜欢使用一个单一的grep，但然后正则表达式会非常糟糕。 :-)

免责声明：当然，我没有测试它

来源

2011-05-05 20:06:20 hmontoliu

bash脚本优化

回答

相关问题