如何从HTML中提取亚马逊评论？

我一直在试图编写一个Perl脚本来报废亚马逊并下载产品评论，但我一直无法这样做。我一直在使用perl模块LWP :: Simple和HTML :: TreeBuilder :: XPath来实现这一点。如何从HTML中提取亚马逊评论？

对于HTML

<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small"> 
    <span class="a-size-mini a-color-state a-text-bold"> 
    Verified Purchase 
    </span> 
    <div class="a-section"> 
    I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes 

    </div> 
</div> 

</div> 
</div>

我想提取产品的审查。对于这个我写道： -

use LWP::Simple; 

#use HTML::TreeBuilder; 
use HTML::TreeBuilder::XPath; 

# Take the ASIN from the command line. 
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n"; 

# Assemble the URL from the passed ASIN. 
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews"; 

# Set up unescape-HTML rules. Quicker than URI::Escape. 
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); 
my $unescape_re = join '|' => keys %unescape; 

# Request the URL. 
my $content = get($url); 
die "Could not retrieve $url" unless $content; 
my $tree = HTML::TreeBuilder::XPath->new_from_content($content); 
my @data = $tree->findvalues('div[@class ="a-section"]'); 

foreach (@data) 
{ 
    print "$_\n"; 
}

但我没有得到任何输出。任何人都可以指出我的错误吗？

来源

2015-04-01 Aakash Sharma

你应该坚持'uri_unescape'从HTML中删除字符实体。与全球正则表达式一起使用的散列可能会更快，但与从互联网上恢复HTML所花费的时间相比，可能会更快。而'uri_unescape'则更加简洁和自我记录。 – Borodin 2015-04-01 13:14:34

为什么刮亚马逊？你知道他们有一个[产品API]（https://metacpan.org/release/Net-Amazon）？ – 2015-04-08 16:04:55

我觉得XPath的应该是'//div[@class ="a-section"]'（额外//在表达式的开头找到div任何地方HTML）

来源

2015-04-01 08:27:25 mirod

正如choroba说，你的XPath表达式应该//开始寻找对于类型div的后代。现在，您正在文档的根目录搜索<div>元素，并且没有。

你也正在寻找一个class属性是等于到a-section的时候，其实每个div元素的class属性可以包含多个类，像

class="a-section a-subheader a-breadcrumb celwidget"

，你想他们中的任何一个是a-section。

有几种解决方法。最明显的是使用XPath 包含，看是否a-section在类的字符串出现在任何地方，像这样

use strict; 
use warnings; 

use LWP::Simple; 
use HTML::TreeBuilder::XPath; 

my $asin = 'B0031EJBI4'; 

my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews"; 

my $tree = HTML::TreeBuilder::XPath->new->parse(get $url); 

my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]'); 

say scalar @nodes;

该报告在第60个这样的节点。这是正确的结果，你可能不想去任何进一步的，但解决的办法是不是一个安全的，因为它会匹配

<div class="aaa-sections">

节点为好。为了正确解决这个问题，您需要恢复到非XPath HTML::Element方法look_down，像这样，它在a-section之前和之后坚持一个字边界。

my @nodes = $tree->look_down(
    _tag => 'div', 
    class => qr/\ba-section\b/, 
); 

say scalar @nodes;

同样，其结果是正确的64

但即使这样，解决方案将不允许该开始或类似-section非单词字符结束，因为/\b-section\b/将永远不会被发现的类。最常用的解决方案是在look_down条件中使用子例程，如下所示，它将空白字符串上的类字符串（' '正确：不要更改它为/ /或/\s+/），并构建使用所有子字符串的%classes哈希作为关键。然后，一个a-section阶层的存在是一个简单的$classes{'a-section'}

@nodes = $tree->look_down(
    _tag => 'div', 
    sub { 
    return unless my $class = $_[0]->attr('class'); 
    my %classes = map { $_ => 1 } split ' ', $class; 
    $classes{'a-section'}; 
    } 
); 

say scalar @nodes;

再次与此页面的搜索结果是64的值，但是这种解决方案将与任何类的字符串工作。

来源

2015-04-01 13:03:27 Borodin

-1

use LWP::Simple; 

#use HTML::TreeBuilder; 
use HTML::TreeBuilder::XPath; 

# Take the ASIN from the command line. 
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n"; 

# Assemble the URL from the passed ASIN. 
my $url = "http://rads.stackoverflow.com/amzn/click/B00R3DO58K"; 

# Set up unescape-HTML rules. Quicker than URI::Escape. 
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' '); 
my $unescape_re = join '|' => keys %unescape; 

# Request the URL. 
my $content = get($url); 



die "Could not retrieve $url" unless $content; 
my $tree = HTML::TreeBuilder::XPath->new_from_content($content); 
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]'); 


#print $content; 

foreach (@data) 
{ 
    print "$_\n"; 
}

来源

2015-04-01 13:14:29

有一点小故事会很好解释你的帖子。并且它与OP的代码有同样的问题：它不会在'class'属性中找到具有多个值的''元素。 – Borodin 2015-04-01 13:16:59

你的'@ data'数组只包含四个节点，文本为'〜Matthew McConaughey〜Ian McKellen〜Jennifer Lawrence〜Ian McKellen'。当他要求评论时，OP并没有想到什么！ – Borodin 2015-04-01 13:22:40

只是我在span元素属性中给出了包含'// span [@ class =“a-size-base review-text”]'它会给你评论列表...在当前页面的结果.... – 2015-04-02 06:04:22

如何从HTML中提取亚马逊评论？

回答

相关问题