2017-07-01 93 views
0

我是scrapy的新手,到目前为止我已经能够创建几个蜘蛛。我想写一个抓取Yellowpages的蜘蛛,寻找具有404响应的网站,蜘蛛工作正常,但是,分页不起作用。任何帮助都感激不尽。在此先感谢需要帮助YellowPages蜘蛛

# -*- coding: utf-8 -*- 
import scrapy 


class SpiderSpider(scrapy.Spider): 
    name = 'spider' 
    #allowed_domains = ['www.yellowpages.com'] 
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL'] 

    def parse(self, response): 
    for listing in response.css('div.search-results.organic div.srp-listing'): 

     url = listing.css('a.track-visit-website::attr(href)').extract_first() 

     yield scrapy.Request(url=url, callback=self.parse_details) 


    # follow pagination links 

    next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first() 
    next_page_url = response.urljoin(next_page_url) 
    if next_page_url: 
     yield scrapy.Request(url=next_page_url, callback=self.parse) 

    def parse_details(self,response): 
    yield{'Response': response,} 
+0

嗨大卫,这是我在这里的第一次发帖,我是有格式的代码问题。我的问题很简单我有这个蜘蛛的分页问题。不知道我在这里错过什么 – oscarQ

回答

1

我跑你的代码,发现有一些错误。在第一个循环中,您不检查url的值,有时它是None。这个错误会停止执行,这就是为什么你认为分页不起作用。

这里是一个工作代码:

# -*- coding: utf-8 -*- 
import scrapy 


class SpiderSpider(scrapy.Spider): 
    name = 'spider' 
    #allowed_domains = ['www.yellowpages.com'] 
    start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL'] 

    def parse(self, response): 
     for listing in response.css('div.search-results.organic div.srp-listing'): 
      url = listing.css('a.track-visit-website::attr(href)').extract_first() 
      if url: 
       yield scrapy.Request(url=url, callback=self.parse_details) 
     next_page_url = response.css('a.next.ajax-page::attr(href)').extract_first() 
     next_page_url = response.urljoin(next_page_url) 
     if next_page_url: 
      yield scrapy.Request(url=next_page_url, callback=self.parse) 

    def parse_details(self,response): 
     yield{'Response': response,} 
+0

非常感谢,你们真棒! – oscarQ

+0

没问题,如果这解决了您的问题,请毫不犹豫地验证答案。 –