perl HTML解析的一些帮助

我正在开发一个小的perl程序，它将打开一个网站并搜索Hail Reports这个词并将信息还给我。我对perl非常陌生，所以这可能很容易解决。首先我的代码说我正在使用一个单位化的值。以下是我有perl HTML解析的一些帮助

#!/usr/bin/perl -w 
use LWP::Simple; 

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
    or die "Could not fetch NWS page."; 
$html =~ m{Hail Reports} || die; 
my $hail = $1; 
print "$hail\n";

其次，我想正则表达式将做我想做的最简单的方法，但我不知道如果我可以与他们无关了。我希望我的程序搜索Hail Reports，并将Hails Reports和Wind Reports这两个词之间的信息发回给我。这是可能的正则表达式或我应该使用不同的方法？这里是我希望它在$ 1中发回

 <tr><th colspan="8">Hail Reports (<a href="last3hours_hail.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_hail.csv">Raw Hail CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr> 

#The Data here will change throughout the day so normally there will be more info. 
     <tr><td colspan="8" class="highlight" align="center">No reports received</td></tr> 
     <tr><th colspan="8">Wind Reports (<a href="last3hours_wind.csv">CSV</a>)&nbsp;(<a href="last3hours_raw_wind.csv">Raw Wind CSV</a>)(<a href="/faq/#6.10">?</a>)</th></tr>

来源

2010-07-02 shinjuo

您可以使用XPath来试用吗？ – 2010-07-02 19:46:31

你被捕捉什么，因为没有你的正则表达式是用括号括起来的网页源代码中的一个片段。以下对我有用。

#!/usr/bin/perl 
use strict; 
use warnings; 

use LWP::Simple; 

my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
    or die "Could not fetch NWS page."; 

$html =~ m{Hail Reports(.*)Wind Reports}s || die; #Parentheses indicate capture group 
my $hail = $1; # $1 contains whatever matched in the (.*) part of above regex 
print "$hail\n";

来源

2010-07-02 19:56:44 d5e5

谢谢，很好地涵盖了这两个问题。 – shinjuo 2010-07-02 20:02:07

括号在正则表达式中捕获字符串。你的正则表达式中没有括号，所以$ 1没有设置任何值。如果您有：

$html =~ m{(Hail Reports)} || die;

然后$ 1.将被设置为“冰雹报告”，如果它在$ HTML变量存在。既然你只是想知道它是否匹配，那么你真的不需要在这一点上捕获任何你可以写这样的：

unless ($html =~ /Hail Reports/) { 
    die "No Hail Reports in HTML"; 
}

要捕获你可以做一些像琴弦之间的事情：

if ($html =~ /(?<=Hail Reports)(.*?)(?=Wind Reports)/s) { 
    print "Got $1\n"; 
}

来源

2010-07-02 19:57:06 runrig

你需要正则表达式的's'修饰符来匹配换行符，即=〜/.../s – 2010-07-02 20:02:50

谢谢。更新。 – runrig 2010-07-02 20:05:25

未初始化值警告来自$ 1 - 它没有被定义或设置在任何地方。

对于线路电平，而不是“之间的”字节级的，你可以使用：

for (split(/\n/, $html)) { 
    print if (/Hail Reports/ .. /Wind Reports/ and !/(?:Hail|Wind) Reports/); 
}

来源

2010-07-02 20:03:32

利用的单和多线相匹配。另外，它只会为文本之间的第一个匹配，这会比贪婪更快一些。

#!/usr/bin/perl -w 

use strict; 
use LWP::Simple; 

    sub main{ 
     my $html = get("http://www.spc.noaa.gov/climo/reports/last3hours.html") 
       or die "Could not fetch NWS page."; 

     # match single and multiple lines + not greedy 
     my ($hail, $between, $wind) = $html =~ m/(Hail Reports)(.*?)(Wind Reports)/sm 
       or die "No Hail/Wind Reports"; 

     print qq{ 
       Hail:   $hail 
       Wind:   $wind 
       Between Text: $between 
      }; 
    } 

    main();

来源

2010-07-03 00:48:14

perl HTML解析的一些帮助

回答

相关问题