2015-11-19 179 views
0

Perl新手和我正在挖掘我可以做的事以及所有这些伟大的库的支持和文档;但是,我正在处理我正在处理的脚本的问题。在实现HTML :: TagFilter之前,我使用第63行(打印FH $ tree-> as_HTML)来打印文件以查找我正在寻找的html内容。我专门寻找身体标记中的所有内容。现在我只想打印出没有任何属性的p标签,h标签和img标签。当我运行我的代码时,文件被创建在正确的目录中,但是在每个文件中打印一个散列对象(HTML :: Element = HASH(0x3a104c8))。HTML :: TagFilter返回HTML :: Element HASH对象

use open qw(:locale); 
use strict; 
use warnings qw(all); 

use HTML::TreeBuilder 5 -weak; # Ensure weak references in use 
use URI::Split qw/ uri_split uri_join /; 
use HTML::TagFilter; 

my @links; 

open(FH, "<", "index/site-index.txt") 
    or die "Failed to open file: $!\n"; 
while(<FH>) { 
    chomp; 
    push @links, $_; 
} 
close FH; 

my $dir = ""; 
while($dir eq ""){ 
print "What is the name of the site we are working on? "; 
$dir = <STDIN>; 
chomp $dir; 
} 

#make directory to store files 
mkdir($dir); 

my $entities = ""; 
my $indent_char = "\t"; 
my $filter = HTML::TagFilter->new(
    allow=>{ p => { none => [] }, h1 => { none => [] }, h2 => { none => [] }, h3 => { none => [] }, h4 => { none => [] }, h5 => { none => [] }, h6 => { none => [] }, img => { none => [] }, }, 
    log_rejects => 1, 
    strip_comments => 1 
    ); 

foreach my $url (@links){ 

    #print $url; 

    my ($filename) = $url =~ m#([^/]+)$#; 

    #print $filename; 
    $filename =~ tr/=/_/; 
    $filename =~ tr/?/_/; 
    #print "\n"; 

    my $currentfile = $dir . '/' . $filename . '.html'; 

    print "Preparing " . $currentfile . "\n" . "\n"; 

    open (FH, '>', $currentfile) 
     or die "Failed to open file: $!\n"; 


    my $tree = HTML::TreeBuilder->new_from_url($url); 
    $tree->parse($url); 
    $tree = $tree->look_down('_tag', 'body'); 
    if($tree){ 
     $tree->dump; # a method we inherit from HTML::Element 
     print FH $filter->filter($tree); 
     #print FH $tree->as_HTML($entities, $indent_char), "\n"; 
    } else{ 
     warn "No body tag found"; 
    } 

    print "File " . $currentfile . " completed.\n" . "\n"; 

    close FH; 

} 

为什么会发生这种情况,以及如何打印我正在查找的实际内容?

谢谢。

回答

1

$filter->filter()期望HTML,HTML::TreeBuilder不是HTML,而是HTML::Element的子类。 look_down()返回HTML::Element。这是您从打印中看到的内容,因为当您将此引用视为字符串时,您将获得对象的字符串表示形式。 HTML::Element=HASH(0x7f81509ab6d8),这意味着对象HTML::Element,这是由一个HASH结构和这个对象的内存地址解决。

你可以通过调用过滤器从look_down的HTML解决了一切:

  print FH $filter->filter($tree->as_HTML); 
+0

真棒!非常感谢你的帮助!我知道我还需要做更多的事情。现在为什么该对象是打印对象而不是内容。再次感谢! –