2013-04-22 163 views
0

我想打开pdf文件所在的网页上的所有链接,并将这些pdf文件存储在我的系统上。在使用scrapy制作的网络爬虫中调用另一个蜘蛛的一个蜘蛛

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from bs4 import BeautifulSoup 


class spider_a(BaseSpider): 
    name = "Colleges" 
    allowed_domains = ["http://www.abc.org"] 
    start_urls = [ 
     "http://www.abc.org/appwebsite.html", 
     "http://www.abc.org/misappengineering.htm", 
    ] 

    def parse(self, response): 
     soup = BeautifulSoup(response.body) 
     for link in soup.find_all('a'): 
      download_link = link.get('href') 
      if '.pdf' in download_link: 
       pdf_url = "http://www.abc.org/" + download_link 
       print pdf_url 

与上面的代码,我能够找到预期的网页的链接,其中的PDF文件所在

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

class FileSpider(BaseSpider): 
    name = "fspider" 
    allowed_domains = ["www.aicte-india.org"] 
    start_urls = [ 
     "http://www.abc.org/downloads/approved_institut_websites/an.pdf#toolbar=0&zoom=85" 
    ] 

    def parse(self, response): 
     filename = response.url.split("/")[-1] 
     open(filename, 'wb').write(response.body) 

有了这个代码,我可以节省start_urls列出的网页身体。

是否有加入这两个蜘蛛的方法,以便我可以通过运行我的爬虫来保存这些pdf文件?

回答

2

为什么你需要两只蜘蛛?

from urlparse import urljoin 
from scrapy.http import Request 
from scrapy.selector import HtmlXPathSelector 

class spider_a(BaseSpider): 
    ... 
    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     for href in hxs.select('//a/@href[contains(.,".pdf")]'): 
      yield Request(urljoin(response.url, href), 
        callback=self.save_file) 

    def save_file(self, response): 
     filename = response.url.split("/")[-1] 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
+0

嗨@steven感谢您的帮助 但我收到以下错误: exceptions.AttributeError:“HtmlXPathSelector”对象有没有属性“发现” – user2253803 2013-04-23 04:58:23

+0

那是因为你需要使用'select',不是'发现'...如果你正在使用Scrapy,你不需要美丽的汤。 – 2013-04-23 13:41:49