2016-02-26 103 views
-4

我有以下的Perl代码:全球正则表达式匹配挂

# $content is the text of a webpage 
while ($content =~ /rgRow.*?<td>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>.*?<\/td><td.*?>(.*?)<\/td><td.*?><nobr>(.*?)<\/nobr><\/td>/sg) { 
    # do stuff 
} 

我曾指出,该代码是挂在这个表达式调用。它会在while循环中进行2-3次迭代,然后它会挂起。我已经离开了大约30分钟,并没有继续。

可能是什么问题?

该代码的目的是通过一些HTML并从中提取一些数据。

这里是我设置$content到HTML:

<tbody> 
     <tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__0"> 
      <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : SECOND PERIODIC REPORT OF STATES PARTIES DUE IN 1974/MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.65/Add.1</td><td><nobr>21 Feb 1974</nobr></td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl04_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.65%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.65/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__1"> 
      <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : INITIAL REPORTS OF STATES PARTIES WHICH ARE DUE IN 1972/MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.33/Add.1</td><td><nobr>17 Jan 1972</nobr></td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl06_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.33%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.33/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__2"> 
      <td>Annex I to ALGERIA's Report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl08_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13691&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13691_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13691</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__3"> 
      <td>Annex II to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl10_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13692&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13692_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13692</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__4"> 
      <td>Annex III to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl12_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13693&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13693_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13693</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__5"> 
      <td>CERD-C-NZ-18-20_Annexes</td><td>Annex to State party report</td><td>CERD</td><td>New Zealand</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl14_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fNZL%2f13731&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_NZL_13731_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/NZL/13731</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__6"> 
      <td>CERD.C.RUS.20-22_Annex1</td><td>Annex to State party report</td><td>CERD</td><td>Russian Federation</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl16_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fRUS%2f13732&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">R</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_RUS_13732_R.doc</td><td style="display:none;">INT/CERD/ADR/RUS/13732</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__7"> 
      <td>Annex to State party report</td><td>Annex to State party report</td><td>CERD</td><td>Poland</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl18_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fPOL%2f15432&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_POL_15432_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/POL/15432</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__8"> 
      <td>Annexe X</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl20_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15561&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15561_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15561</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
     </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__9"> 
      <td>Annexe XI</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td> 
              <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl22_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15562&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
             </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15562_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15562</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td> 
</tr> 
</tbody> 

我想下面的行,看看它是如何去代替:

while ($content =~ m/rgRow.+?<td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td>/gs) 

原始代码是不是我的。

+0

请显示您正在尝试解析的HTML。无论如何,正则表达式不是解析HTML的正确工具,为什么不使用HTML解析器? –

+5

[需要阅读的人试图用正则表达式解析XML/HTML](http://stackoverflow.com/a/1732454/18157)。简介:不要使用正则表达式解析HTML/XML,请使用适当的解析器。 –

+0

同意上面的说法,但是如果你需要这样做,那么如何用'qr'打破这个讨厌的阵容呢?看起来要容易得多。 – zdim

回答

0

我把这个问题当作调试旧代码的问题。 (尽管如此,请参阅解析器示例的结尾。)

报告的问题是正则表达式挂起。对于我来说,它会在第一场比赛结束后退出。我的第一个嫌疑人是一个松散的新线; /s修饰符只会使.匹配一个新行。另一个嫌疑人是rgRow短语明确匹配 - 它也是<td>标签中的一个属性,所以在.*下也匹配 - 冲突?最后,正则表达式显式寻找每个单元,同时使用/g修饰符。作为参考,这是正则表达式,用于代码/sg修饰符。

$patt = qr/rgRow.*? 
    <td> (.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> .*? <\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> <nobr>(.*?)<\/nobr> <\/td> 
/x; 

通过char拾取源char是不愉快的,它通常不起作用。我们可以改为:删除新行,然后将<td>标签的内容捕获到数组中。正则表达式的目的正是为了解决这个问题。 (我改变正则表达式的分隔符,以避免编辑着色。)

use warnings; 
use strict; 

my $msg = 'pulled_from_url'; 
(my $msg_nonl = $msg) =~ s%\n%%g; 

my @raw_cells = $msg_nonl =~ |<td.*?>(.*?)<\/td>|g; 

# Once we are at it: strip <nobr>, &nbsp;, drop empty elements 
@cells = grep { !/^\s*$/ } map { s%<\/?nobr>|&nbsp;%%g; $_ } @raw_cells; 
# Get links ("View Document") out as well 
@content = grep { !/<a.*?\/a>/ } @cells; 
print "Total of " . scalar(@raw_cells) . " cells. "; 
print "Cleaned up, down to " . scalar(@content) . " cells.\n"; 
print "$_\n" for @content; 

这将打印单元的内容,在此编辑的空间

 
Total of 280 cells. Cleaned up, down to 82 cells. 
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1974/MOROCCO 
State party's report 
... 
21 Feb 1974 
... 
True 
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1972/MOROCCO 
State party's report 
... 
17 Jan 1972 
... 
True 

通过检查,我们可以看到,内容是否正确拉HTML。

我并不是要判断海报的动机,而是判断限制。但是,我无法帮助它,但将上面的猜测工作和仔细的来源阅读与以下内容进行比较。

use HTML::TableExtract; 
my $te = HTML::TableExtract->new(keep_html => 1); 
$te->parse("<table> " . $msg . "</table>"); 
# We have one table, use top-level 'rows()' shorthand method 
foreach my $row ($te->rows) { 
    print join(',', @$row), "\n"; 
} 

这会报告相同的280个单元格(添加计数时)并打印相同的行作为上述步骤之一。我只需要浏览源代码就可以看到它缺少<table>标签。 HTML::TableExtractHTML::Parser的一个子类。

0

您的正则表达式要求第六列包含<nobr>...</nobr>标签,它只发生在前两行。它之后就会挂起来,因为非贪婪的量词只能做很多事情。当不可能匹配时,它们就像贪婪的品种一样容易遭受灾难性的回溯。

而不是依靠.*?所有的时间,试图具体说明你想要匹配。在这种情况下,这很简单:您匹配的TD不会包含其他标签,因此您可以使用[^<>]*来捕获其内容。事实上,你应该在目前使用的地方使用.*?

在下面的正则表达式中,我还将NOBR标记设置为可选项,再加上我将其扩展为匹配整个打开的TR标记,更为了可读性的缘故。

while ($content =~ 
    m!<tr\s+class="rgRow[^<>]*>\s* 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>[^<>]*</td> 
    <td[^<>]*>([^<>]*)</td> 
    <td[^<>]*>(?:<nobr>)?([^<>]*)(?:</nobr>)?</td> 
    !sxg) { 
    # do stuff 
}