2014-09-25 125 views
0

这是我想解析如何使用HTML :: TreeBuilder解析html?

[...] 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
[...] 

代码这是我的算法:

my $text = ''; 

scan_child($spells); 

print $text, "\n"; 

sub scan_child { 
    my $element = $_[0]; 
    return if ($element->tag eq 'script' or 
      $element->tag eq 'a'); # prune! 
    foreach my $child ($element->content_list) { 
    if (ref $child) { # it's an element 
     scan_child($child); # recurse! 
    } else {   # it's a text node! 
     $child =~ s/(.*)\:/\\item \[$1\]/; #itemize 
     $text .= $child; 
     $text .= "\n"; 
    } 
    } 
    return; 
} 

它得到的模式<key> : <value>和李子垃圾像<script><a>...</a>。 我想改进它以获得<h2>...</h2>标题和所有<p>...<p>块,以便我可以添加一些LaTeX标记。

任何线索?

在此先感谢。

+0

也许你应该退后一步,计算出你想从你正在抓取的页面中提取什么信息,以及你想如何存储它。如果您有一个特定的模式或数据结构,将其添加到问题中将会很有帮助。如果你只是想提取所有的文字,那么你已经很顺利。 – 2014-09-25 20:58:14

+0

也许,我仍然不清楚HTML :: TreeBuilder在节点中存储了什么。 – Daniele 2014-09-25 21:39:22

回答

0

因为这可能是一个问题XY ...

Mojo::DOM是使用CSS选择器解析HTML稍微更现代的框架。下面拉你从文档所需的P元素:

use strict; 
use warnings; 

use Mojo::DOM; 

my $dom = Mojo::DOM->new(do {local $/; <DATA>}); 

for my $h2 ($dom->find('h2')->each) { 
    next unless $h2->all_text eq 'Acid Splash'; 

    # Get following P 
    my $next_p = $h2; 
    while ($next_p = $next_p->next_sibling()) { 
     last if $next_p->node eq 'tag' and $next_p->type eq 'p'; 
    } 

    print $next_p; 
} 

__DATA__ 
<html> 
<body> 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
</body> 
</html> 

输出:

<p>Caster Level(s): Wizard/Sorcerer 0 
    <br>Innate Level: 0 
    <br>School: Conjuration 
    <br>Descriptor(s): Acid 
    <br>Component(s): Verbal, Somatic 
    <br>Range: Medium 
    <br>Area of Effect/Target: Single 
    <br>Duration: Instant 
    <br>Save: None 
    <br>Spell Resistance: Yes 
    </p> 
0

我使用look_down()方法扫描HTML。 使用look_down()我可以先返回所有class =“item”的div的列表。

然后我可以迭代它们,并找到并处理h2p,然后我将使用//作为分隔符分割。