我已经能够使用简单的HTML结构从网站上抓取数据,并使用Perl模块Web::Scraper从各种标签中检索数据。但是,我遇到了一个data-
属性,我无法按照通常的方式进行处理。如何使用Web :: Scraper获取数据属性的值?
的标签是:
<img class="slide_image"
src="https://image.slidesharecdn.com/computerassistedsurgery-160629113952/95/computer-assisted-surgery-1-638.jpg?cb=1467200461"
data-small="https://image.slidesharecdn.com/computerassistedsurgery-160629113952/85/computer-assisted-surgery-1-320.jpg?cb=1467200461"
data-normal="https://image.slidesharecdn.com/computerassistedsurgery-160629113952/95/computer-assisted-surgery-1-638.jpg?cb=1467200461"
data-full="https://image.slidesharecdn.com/computerassistedsurgery-160629113952/95/computer-assisted-surgery-1-1024.jpg?cb=1467200461"
alt="COMPUTER ASSISTED SURGERY Something ">
我需要的部分是"https://image.slidesharecdn.com/computerassistedsurgery-160629113952/95/computer-assisted-surgery-1-1024.jpg?cb=1467200461"
属性data-full
后到来。我当前的代码是:
use strict;
use warnings;
use lib "lib";
use URI;
use Web::Scraper;
use YAML;
use WWW::Mechanize;
use URI::Encode;
use HTTP::Cookies;
use LWP::UserAgent;
use Data::Dumper;
my $purlToScrape='https://www.slideshare.net/drdeepashivnani/computer-assisted-surgery?from_m_app=android';
print "Scraping $purlToScrape\n";
my $noticescr = scraper {
process 'section>img', 'link[]' => 'TEXT';
};
my $notices = $noticescr->scrape(URI->new($purlToScrape));
print Dumper($notices);
这失败,出现错误:
Don't know what to do with 0 => undef at /usr/local/share/perl/5.20.2/Web/Scraper.pm line 150.
我该如何解决这个问题?
这些都是属性,而不是标签。 – simbabque
对。原谅错字。 – Droidzone
你加载了很多你不需要的东西。 – simbabque