如何使用Perl访问JavaScript驱动的网页的内容？

我试图用Perl来制作一个小应用程序，以从LolKing中获取英雄联盟的召唤师名字。如何使用Perl访问JavaScript驱动的网页的内容？

的HTML代码有像

<tr data-summonername="MatLife TriHard" class="lb_row_rank_4">

线，所以我只是有一些事情像

use strict; 
use warnings; 

use LWP::Simple; 
use HTML::Parser; 

my $find_links = HTML::Parser->new(
    start_h => [ 
    sub { 
     my ($tag, $attr) = @_; 
     if ($tag eq 'tr' and exists $attr->{'data-summonername'}) { 
     print "$attr->{'data-summonername'}\n"; 
     } 
    }, 
    "tag, attr" 
    ] 
); 

my $html = get('http://www.lolking.net/leaderboards/#/na/1') or die 'nope'; 

$find_links->parse($html);

但是这给我什么。即使有attr=class，它也不会给我什么。由于某些原因，我无法获取tr元素的类。

使用$attr->{data-summonername}没有单引号给我一些错误，由于连字符我想。如果我取$attr->{href}它工作得很好。

有人可以帮我吗？

来源

2015-03-19 TheOne

无耻插头：在Windows上，你可以[获得使用Internet Explorer网页内容]（http://perltricks.com/article/139/2014/12/ 11/Automated-Internet-Explorer-screenshots-using-Win32-OLE），然后使用[HTML :: TableExtract]（http://www.nu42.com/2012/04/htmltableextract-is-beautiful.html）提取您需要的信息。如果你不在Windows上，[通过Firefox获取页面内容]（http://perltricks.com/article/138/2014/12/8/Controlling-Firefox-from-Perl），然后使用HTML :: TableExtract '。当然，也有[PhantomJS]（http://phantomjs.org/）。 – 2015-03-19 12:02:20

问题是，该页面的HTML大部分是由您的浏览器在页面下载完成后使用JavaScript构建的。使用LWP::Simple::get只会检索框架HTML和JavaScript代码。你可以看到，如果你print $html而不是解析它。

通常的解决方案是使用WWW::Mechanize::Firefox，获取已安装的Firefox下载并构建页面，然后可以查询。虽然它比简单的get复杂得多，因为如果你还没有安装Firefox，你必须安装Firefox，以及启用远程控制的Mozilla MozRepl插件。即使在浏览器完成构建之前，您仍然可能会遇到访问页面内容的问题，所以这并不是因为内心的微弱。

更新

为了您的利益，这里是用WWW::Mechanize::Firefox的解决方案。

use strict; 
use warnings; 

use WWW::Mechanize::Firefox; 
use HTML::TreeBuilder::XPath; 

my $url = 'http://www.lolking.net/leaderboards/#/na/1'; 

my $mech = WWW::Mechanize::Firefox->new; 
my $resp = $mech->get($url); 
die $resp->status_line unless $resp->is_success; 

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->content); 

for my $node ($tree->findnodes('//tr[starts-with(@class, "lb_row_rank")]')) { 
    printf "Rank %2d: %s\n", 
     $node->attr('class') =~ /(\d+)/, 
     $node->attr('data-summonername'); 
}

输出

Rank 1: Doublelift 
Rank 2: F5 Veritas 
Rank 3: Life Love Live 
Rank 4: MatLife TriHard 
Rank 5: TDK Kyle 
Rank 6: Liquid FeniX 
Rank 7: Liquid Inori TV 
Rank 8: dawoofsclaw 
Rank 9: who is he 
Rank 10: Ohhhq

来源

2015-03-19 11:39:20 Borodin

如何使用Perl访问JavaScript驱动的网页的内容？

回答

相关问题