2017-08-25 64 views
0

我想建立这种爬虫从Craigslist网站得到住房数据,Scrapy履带不会递归爬行下一页

,但获取的第一页后,履带停止,不进入下一个页面。

下面是代码,它的工作原理为第一页,但对上帝的爱我不明白为什么它不进入下一个页面。任何见解是非常感谢。我跟着this part from scrapy tutorial

import scrapy 
import re 

from scrapy.linkextractors import LinkExtractor 




class QuotesSpider(scrapy.Spider): 
    name = "craigslistmm" 
    start_urls = [ 
     "https://vancouver.craigslist.ca/search/hhh" 
    ] 



    def parse_second(self,response): 
     #need all the info in a dict 
     meta_dict = response.meta 
     for q in response.css("section.page-container"): 
      meta_dict["post_details"]= { 
       "location": 
        {"longitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-longitude)").extract(), 
       "latitude":q.css("div.mapAndAttrs div.mapbox div.viewposting::attr(data-latitude)").extract()}, 

       "detailed_info": ' '.join(q.css('section#postingbody::text').extract()).strip() 

      } 


     return meta_dict 





    def parse(self, response): 
     pattern = re.compile("\/([a-z]+)\/([a-z]+)\/.+") 
     for q in response.css("li.result-row"): 

      post_urls = q.css("p.result-info a::attr(href)").extract_first() 
      mm = re.match(pattern, post_urls) 

      neighborhood= q.css("p.result-info span.result-meta span.result-hood::text").extract_first() 




      next_url = "https://vancouver.craigslist.ca/"+ post_urls 
      request = scrapy.Request(next_url,callback=self.parse_second) 
      #next_page = response.xpath('.//a[@class="button next"]/@href').extract_first() 
      #follow_url = "https://vancouver.craigslist.ca/" + next_page 
      #request1 = scrapy.Request(follow_url,callback=self.parse) 
      #yield response.follow(next_page,callback = self.parse) 


      request.meta['id'] = q.css("li.result-row::attr(data-pid)").extract_first() 
      request.meta['pricevaluation'] = q.css("p.result-info span.result-meta span.result-price::text").extract_first() 
      request.meta["information"] = q.css("p.result-info span.result-meta span.housing::text").extract_first() 
      request.meta["neighborhood"] =q.css("p.result-info span.result-meta span.result-hood::text").extract_first() 
      request.meta["area"] = mm.group(1) 
      request.meta["adtype"] = mm.group(2) 


      yield request 
      #yield scrapy.Request(follow_url, callback=self.parse) 

     next_page = LinkExtractor(allow="s=\d+").extract_links(response)[0] 


     # = "https://vancouver.craigslist.ca/" + next_page 
     yield response.follow(next_page.url,callback=self.parse) 

回答

0

问题似乎与next_page提取使用LinkExtractor。如果你看看外观,你会看到重复的请求被过滤。页面上有更多链接满足您的提取规则,也许它们不是以任何特定顺序(或不按您希望的顺序)提取。

我认为更好的办法是准确提取所需的信息,这种尝试:

打造next_page
next_page = response.xpath('//span[@class="buttons"]//a[contains(., "next")]/@href').extract_first() 
+0

只有将由此中获取一个链接,我都试过DIFF方式,类似于你所提到的,但它没有奏效。 – Bg1850

+0

但我再试着用你的soln并更新 – Bg1850

+0

它对我有用(至少直到我的IP被阻止.. :-)) –