蟒蛇scrapy从几个页面收集数据到一个项目（字典）

我有一个网站刮。在主页上有故事情节 - 所以，这个页面将成为我们的开始解析页面。我的蜘蛛从它那里收集关于每个故事的数据 - 作者，评级，出版日期等等。这一切都是由蜘蛛正确完成的。蟒蛇scrapy从几个页面收集数据到一个项目（字典）

import scrapy 
from scrapy.spiders import Spider 
from sxtl.items import SxtlItem 
from scrapy.http.request import Request 


class SxtlSpider(Spider): 
    name = "sxtl" 

    start_urls = ['some_site'] 


    def parse(self, response): 

     list_of_stories = response.xpath('//div[@id and @class="storyBox"]') 

     item = SxtlItem() 

     for i in list_of_stories: 

      pre_rating = i.xpath('div[@class="storyDetail"]/div[@class="stor\ 
       yDetailWrapper"]/div[@class="block rating_positive"]/span/\ 
       text()').extract() 
      rating = float(("".join(pre_rating)).replace("+", "")) 

      link = "".join(i.xpath('div[@class="wrapSLT"]/div[@class="title\ 
       Story"]/a/@href').extract()) 

      if rating > 6: 
       yield Request("".join(link), meta={'item':item}, callback=\ 
                  self.parse_story) 
      else: 
       break 

    def parse_story(self, response): 

     item = response.meta['item'] 

     number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\ 
             [last()-1]/text()').extract() 

     if number_of_pages: 
      item['number_of_pages'] = int("".join(number_of_pages)) 
     else: 
      item['number_of_pages'] = 1 

     item['date'] = "".join(response.xpath('//span[@class="date"]\ 
               /text()').extract()).strip() 
     item['author'] = "".join(response.xpath('//a[@class="author"]\ 
               /text()').extract()).strip() 
     item['text'] = response.xpath('//div[@id="storyText"]/div\ 
       [@itemprop="description"]/text() | //div[@id="storyText"]\ 
         /div[@itemprop="description"]/p/text()').extract() 
     item['list_of_links'] = response.xpath('//div[@class="pNavig"]\ 
              /a[@href]/@href').extract() 

     yield item

因此，数据收集正确，但我们只有每个故事的第一页。但是，每个莎莉都有几页（并且链接到第2,3,4页，有时15页）。这就是问题出现的地方。我这个替换产量项目：（让每一个故事的第2页）

yield Request("".join(item['list_of_links'][0]), meta={'item':item}, \ 
               callback=self.get_text) 


def get_text(self, response): 

    item = response.meta['item'] 

    item['text'].extend(response.xpath('//div[@id="storyText"]/div\ 
     [@itemprop="description"]/text() | //div[@id="storyText"]\ 
       /div[@itemprop="description"]/p/text()').extract()) 

    yield item

蜘蛛收集下一个（第二）页，但只将它们加入到任何故事的第一页。例如，第一个故事的第二页可能被添加到第四个故事。第5个故事的第2页被添加到第1个故事。等等。

请帮忙，如果要将数据收集到一个项目（一个字典），如果要抓取的数据在多个网页上传播？（在这种情况下 - 如何不让来自不同项目的数据彼此混合？）

谢谢。

来源

2017-04-16 User New

您是否检查此链接：http://stackoverflow.com/questions/13910357/how-can-i-use-multiple-requests-and-pass-items-in-between-them-in-scrapy-python ？ – Wandrille

@Wandrille我已经找到了解决方案，但感谢有趣的链接。 –

很多尝试和一大堆文档的阅读后，我找到了解决办法：

item = SxtlItem()

该项目申报应解析功能被移到parse_story功能的开始。并且应该删除parse_story中的“item = response.meta ['item']”。而且，当然，

yield Request("".join(link), meta={'item':item}, callback=self.parse_story)

在 “解析”

应改为

yield Request("".join(link), callback=self.parse_story)

为什么？因为Item只声明了一次，所有的字段都在不断被重写。虽然在文档中只有一个页面 - 它看起来好像一切正常，并且好像我们有一个“新”项目。但是当一个故事有几页时，这个项目被一些混乱的方式覆盖，我们收到混乱的结果。不久之后：New Item应该创建多少次，就像我们要保存的许多Item对象一样。

将“item = SxtlItem（）”移动到正确的位置后，所有内容都可以正常工作。

来源

2017-04-17 14:14:48

非技术上讲： -

1）刮的故事1 第2页）检查，如果有更多的页面或不 3）如果没有，只是yield项目 4）如果有下一页按钮/链接，抓取该链接并将整个数据字典传递到下一个回调方法。

def parse_story(self, response): 

    item = response.meta['item'] 

    number_of_pages = response.xpath('//div[@class="pNavig"]/a[@href]\ 
            [last()-1]/text()').extract() 

    if number_of_pages: 
     item['number_of_pages'] = int("".join(number_of_pages)) 
    else: 
     item['number_of_pages'] = 1 

    item['date'] = "".join(response.xpath('//span[@class="date"]\ 
              /text()').extract()).strip() 
    item['author'] = "".join(response.xpath('//a[@class="author"]\ 
              /text()').extract()).strip() 
    item['text'] = response.xpath('//div[@id="storyText"]/div\ 
      [@itemprop="description"]/text() | //div[@id="storyText"]\ 
        /div[@itemprop="description"]/p/text()').extract() 
    item['list_of_links'] = response.xpath('//div[@class="pNavig"]\ 
             /a[@href]/@href').extract() 

    # if it has NEXT PAGE button 
    if nextPageURL > 0: 
     yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item}) 
    else: 
     # it has no more pages, so just yield data. 
     yield item 





def get_text(self, response): 

    item = response.meta['item'] 


    # merge text here 
    item['text'] = item['text'] + response.xpath('//div[@id="storyText"]/div\ 
     [@itemprop="description"]/text() | //div[@id="storyText"]\ 
       /div[@itemprop="description"]/p/text()').extract() 


    # Now again check here if it has NEXT PAGE button call same function again. 
    if nextPageURL > 0: 
     yield Request(url= nextPageURL , callback=self.get_text, meta={'item':item}) 
    else: 
     # no more pages, now finally yield the ITEM 
     yield item

来源

2017-04-16 16:49:42 Umair

谢谢，但这正是我想要做的。 “1）Scrape story第1页2）检查是否有更多的页面3）如果没有，只是产生项目4）如果它有下一页按钮/链接，刮掉该链接，并将整个数据字典传递到下一个回调方法。”失败后，我试着从网站上刮掉1或2页。无论如何，我已经解决了这个问题，我会在答案中显示它。 –

但无论如何，我感谢您的时间和关注。谢谢。 –

蟒蛇scrapy从几个页面收集数据到一个项目（字典）

回答

相关问题