多页的scrapy让我的项目太快而无法完成 - 函数无法链接并等待完成

我正在制作一个足球应用程序，试图围绕多页面刮擦的工作方式来打动我的头。多页的scrapy让我的项目太快而无法完成 - 函数无法链接并等待完成

例如，在第一页（http://footballdatabase.com/ranking/world/1）是2套的链接我想刮：俱乐部名称的链接，以及分页链接

我想通过一）每一页（分页），然后b）通过每个俱乐部，并抓住其当前欧盟排名。

我写的代码有些作品。不过，我最终只得到大约45个结果，而不是2000多个俱乐部。 - 注意：有45页的分页。所以一旦它完成了，所有东西都完成了并且我的物品被放弃了。

我怎样才能让所有链条连在一起，所以我最终得到的结果更像2000+？

这里是我的代码

# get Pagination links 
def parse(self, response): 
    for href in response.css("ul.pagination > li > a::attr('href')"): 
     url = response.urljoin(href.extract()) 
     yield scrapy.Request(url, callback=self.parse_club) 

# get club links on each of the pagination pages 
def parse_club(self, response): 


    # loop through each of the rows 
    for sel in response.xpath('//table/tbody/tr'): 

     item = rankingItem() 

      item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract() 

      # get more club information 
      club_href = sel.xpath('td[2]/a[1]/@href').extract_first() 
      club_url = response.urljoin(club_href) 
      request = scrapy.Request(club_url,callback=self.parse_club_page_2) 

      request.meta['item'] = item 
      return request 

# get the EU ranking on each of the club pages 
def parse_club_page_2(self,response): 

    item = response.meta['item'] 
    item['eu_ranking'] = response.xpath('//a[@class="label label-default"][2]/text()').extract() 

    yield item

来源

2016-02-26 willdanceforfun

您从parse_club回调需要yield - 不return：

# get club links on each of the pagination pages 
def parse_club(self, response): 
    # loop through each of the rows 
    for sel in response.xpath('//table/tbody/tr'):  
     item = rankingItem()  
     item['name'] = sel.xpath('td/a/div[@class="limittext"]/text()').extract() 

     # get more club information 
     club_href = sel.xpath('td[2]/a[1]/@href').extract_first() 
     club_url = response.urljoin(club_href) 
     request = scrapy.Request(club_url,callback=self.parse_club_page_2) 

     request.meta['item'] = item 
     yield request # FIX HERE

我也将简化元素的定位部分：

def parse_club(self, response): 
    # loop through each of the rows 
    for sel in response.css('td.club'): 
     item = rankingItem() 
     item['name'] = sel.xpath('.//div[@itemprop="itemListElement"]/text()').extract_first() 

     # get more club information 
     club_href = sel.xpath('.//a/@href').extract_first() 
     club_url = response.urljoin(club_href) 
     request = scrapy.Request(club_url, callback=self.parse_club_page_2) 

     request.meta['item'] = item 
     yield request

来源

2016-02-26 15:26:43 alecxe

多页的scrapy让我的项目太快而无法完成 - 函数无法链接并等待完成

回答

相关问题