什么是算在Perl中的串词的数量最快的方法是什么？

我有我上运行各种文本超过一百万次，这意味着在这些功能的小改进，转化为巨大收益总体的一些功能。目前，我注意到我的所有功能，其中涉及字数大幅更长的时间比其他一切的运行，所以我想我想尝试用不同的方式做字数。什么是算在Perl中的串词的数量最快的方法是什么？

基本上，我的功能确实是抓住了一些具有与之相关联的文本对象，验证文中不符特定的模式，再算上该文本的单词数。该函数的基本版本是：

my $num_words = 0; 
for (my $i=$begin_pos; $i<=$end_pos; $i++) { 
    my $text = $self->_getTextFromNode($i); 
    #If it looks like a node full of bogus text, or just a number, remove it. 
    if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; } 
    if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; } 
    if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; } 
    my @text_words = split(/\s+/, $text); 
    $num_words += scalar(@text_words); 
    if ($num_words > 30) { return 30; } 
} 
return $num_words; 
}

我做很多类似我在其他地方在这里做在我的代码是什么文本比较的，所以我猜我的问题必须与我的字计数。有没有更快的方式做到这一点比\s+分裂？如果是这样，它是什么，为什么它快（所以我能够理解我在做什么错，并且可以应用知识，以类似的问题以后）。

来源

2011-05-19 Eli

使用while循环用正则表达式是，我发现来算的话最快的方法：

my $text = 'asdf asdf asdf asdf asdf'; 

sub count_array { 
    my @text_words = split(/\s+/, $text); 
    scalar(@text_words); 
} 

sub count_list { 
    my $x =()= $text =~ /\S+/g;  #/ 
} 

sub count_while { 
    my $num; 
    $num++ while $text =~ /\S+/g;  #/ 
    $num 
} 

say count_array; # 5 
say count_list; # 5 
say count_while; # 5 

use Benchmark 'cmpthese'; 

cmpthese -2 => { 
    array => \&count_array, 
    list => \&count_list, 
    while => \&count_while, 
} 

#   Rate list array while 
# list 303674/s -- -22% -55% 
# array 389212/s 28% -- -42% 
# while 675295/s 122% 74% --

while循环更快，因为内存不需要分配给每个找到的单词。此外，正则表达式在布尔上下文中，这意味着它不需要从字符串中提取实际匹配。

while ($text =~ /\S+/g) { 
    ++$num_words == 30 && return $num_words; 
}  
return $num_words;

或者使用split：

来源

2011-05-19 19:18:49

不错！谢谢！这太棒了。 – Eli 2011-05-19 19:22:06

很棒的“为什么”去与“什么”。很好，你指出'基准'用于进一步的实验和优化。 – DCharness 2011-05-19 19:28:39

既然你只需要文字的数量，而不是单词的排列，这将有利于避免使用split。这个东西可能工作：

$num_words += $text =~ s/((^|\s)\S)/$1/g;

它取代了建设有每个单词本身具有替代的工作文字排列的工作。你需要对它进行基准测试，看它是否更快。

编辑：这可能会更快：

++$num_words while $text =~ /\S+/g;

来源

2011-05-19 19:18:39

既然你限制的话到30号，你可以从早期函数返回

$num_words =() = split /\s+/, $text, 30;

来源

2011-05-19 19:20:11

当然。我将它并入@Eric Strom的答案中。 – Eli 2011-05-19 19:27:21

为了确保正确无误，从aleroot's answer，你可能想split " " ，而不是原来split /\s+/以避免错误栅栏柱：上 A“分割”，“/ \ S + /”是像“分裂（”“）”，除了任何前导空白首先产生一个空场。*这种差异会给你每行一个额外的字（空第一场，那是）。

为了加快速度，由于您要将单词数限制为30，因此您可能需要使用LIMIT参数*：split " ", $str, 30。

另一方面，其他的答案明智地指出你完全离开split，因为你不需要单词列表，只需要他们的数量。

来源

2011-05-19 19:21:48 DCharness

每行加一个空格* – ysth 2011-05-20 00:36:26

是的，很好的修正。在任何具有领先空白的行上，根本不在任何行上。 – DCharness 2011-05-20 20:41:25

如果单词由单个空格分隔只，包括空格快。

sub count1 
{ 
    my $str = shift; 
    return 1 + ($str =~ tr{ }{ }); 
}

更新基准：

my $text = 'asdf asdf asdf asdf asdf'; 

sub count_array { 
    my @text_words = split(/\s+/, $text); 
    scalar(@text_words); 
} 

sub count_list { 
    my $x =()= $text =~ /\S+/g;  #/ 
} 

sub count_while { 
    my $num; 
    $num++ while $text =~ /\S+/g;  #/ 
    $num 
} 

sub count_tr { 
    1 + ($text =~ tr{ }{ }); 
} 

say count_array; # 5 
say count_list; # 5 
say count_while; # 5 
say count_tr; # 5 

use Benchmark 'cmpthese'; 

cmpthese -2 => { 
    array => \&count_array, 
    list => \&count_list, 
    while => \&count_while, 
    tr => \&count_tr, 
} 

#   Rate list while array tr 
# list 220911/s -- -24% -44% -94% 
# while 291225/s 32% -- -26% -92% 
# array 391769/s 77% 35% -- -89% 
# tr 3720197/s 1584% 1177% 850% --

来源

2011-05-19 19:48:57 hexcoder

+1盒子外的出色思维。 – TLP 2011-05-19 21:14:29

除非使用/ d，否则替换列表默认为搜索列表，所以'tr///'等同于'tr {} {}' – ysth 2011-05-20 00:35:23

什么是算在Perl中的串词的数量最快的方法是什么？

回答

相关问题