如何在一定数量的请求后停止scrapy蜘蛛？

我正在开发一个简单的刮板来获得9堵嘴帖子和它的图像，但由于一些技术困难iam无法停止刮板，并继续刮，我不想要。我想增加计数器的值，并停止后100个职位。但是9gag页面在每个响应中都以一种方式设计，它只给出10个帖子，在每次迭代之后，我的计数器值重置为10，在这种情况下，我的循环无限长地运行并永不停止。如何在一定数量的请求后停止scrapy蜘蛛？

# -*- coding: utf-8 -*- 
import scrapy 
from _9gag.items import GagItem 

class FirstSpider(scrapy.Spider): 
    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = (
     'http://www.9gag.com/', 
    ) 

    last_gag_id = None 
    def parse(self, response): 
     count = 0 
     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      count +=1 
      if gag_id: 
       if (count != 100): 
        last_gag_id = gag_id[0] 
        ninegag_item = GagItem() 
        ninegag_item['entry_id'] = gag_id[0] 
        ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
        ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
        ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
        ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
        ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 

        yield ninegag_item 


       else: 
        break 


     next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id 
     yield scrapy.Request(url=next_url, callback=self.parse) 
     print count

代码items.py在这里

from scrapy.item import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field()

所以，我想增加一个全球性的计数值和尝试这种传递3个参数解析功能提示错误

TypeError: parse() takes exactly 3 arguments (2 given)

那么，有没有办法通过一个全球计数值并在每次迭代后返回并在100个帖子后停止（假设）。

整个项目可以在这里找到Github 即使我设置POST_LIMIT = 100的无限循环发生，在这里看到的命令我执行

scrapy crawl first -s POST_LIMIT=10 --output=output.json

来源

2016-03-02 Backdoor Cipher

首先通过检索命令传递：使用self.count和初始化的parse之外。然后，不要阻止对项目的解析，但会生成新的requests。请看下面的代码：

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 


class FirstSpider(scrapy.Spider): 

    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = ('http://www.9gag.com/',) 

    last_gag_id = None 
    COUNT_MAX = 30 
    count = 0 

    def parse(self, response): 

     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      ninegag_item = GagItem() 
      ninegag_item['entry_id'] = gag_id[0] 
      ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
      ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
      ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
      ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
      ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 
      self.last_gag_id = gag_id[0] 
      self.count = self.count + 1 
      yield ninegag_item 

     if (self.count < self.COUNT_MAX): 
      next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id 
      yield scrapy.Request(url=next_url, callback=self.parse)

来源

2016-03-02 14:14:10

有没有办法找到了完成刮的时候？ –

工作得很好Thankx @Frank –

count是本地parse()方法，所以它不会保留页面之间。将所有发生的count更改为self.count以使其成为类的实例变量，并且它将在页面之间保持不变。

来源

2016-03-02 14:03:48 Steve

蜘蛛参数使用-a option.check link

来源

2016-03-02 14:03:55

有一个内置的设置CLOSESPIDER_PAGECOUNT，可以通过命令行-s参数传递或改变设置：scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一个小小的警告，如果您启用缓存，它也会计数缓存命中数。

来源

2017-04-01 04:19:39 Dennis

如何在一定数量的请求后停止scrapy蜘蛛？

回答

相关问题