2016-03-02 120 views
3

我正在开发一个简单的刮板来获得9堵嘴帖子和它的图像,但由于一些技术困难iam无法停止刮板,并继续刮,我不想要。我想增加计数器的值,并停止后100个职位。 但是9gag页面在每个响应中都以一种方式设计,它只给出10个帖子,在每次迭代之后,我的计数器值重置为10,在这种情况下,我的循环无限长地运行并永不停止。如何在一定数量的请求后停止scrapy蜘蛛?


# -*- coding: utf-8 -*- 
import scrapy 
from _9gag.items import GagItem 

class FirstSpider(scrapy.Spider): 
    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = (
     'http://www.9gag.com/', 
    ) 

    last_gag_id = None 
    def parse(self, response): 
     count = 0 
     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      count +=1 
      if gag_id: 
       if (count != 100): 
        last_gag_id = gag_id[0] 
        ninegag_item = GagItem() 
        ninegag_item['entry_id'] = gag_id[0] 
        ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
        ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
        ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
        ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
        ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 

        yield ninegag_item 


       else: 
        break 


     next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id 
     yield scrapy.Request(url=next_url, callback=self.parse) 
     print count 

代码items.py在这里

from scrapy.item import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 

所以,我想增加一个全球性的计数值和尝试这种传递3个参数解析功能提示错误

TypeError: parse() takes exactly 3 arguments (2 given) 

那么,有没有办法通过一个全球计数值并在每次迭代后返回并在100个帖子后停止(假设)。

整个项目可以在这里找到Github 即使我设置POST_LIMIT = 100的无限循环发生,在这里看到的命令我执行

scrapy crawl first -s POST_LIMIT=10 --output=output.json 

回答

4

首先通过检索命令传递:使用self.count和初始化的parse之外。然后,不要阻止对项目的解析,但会生成新的requests。请看下面的代码:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy import Item, Field 


class GagItem(Item): 
    entry_id = Field() 
    url = Field() 
    votes = Field() 
    comments = Field() 
    title = Field() 
    img_url = Field() 


class FirstSpider(scrapy.Spider): 

    name = "first" 
    allowed_domains = ["9gag.com"] 
    start_urls = ('http://www.9gag.com/',) 

    last_gag_id = None 
    COUNT_MAX = 30 
    count = 0 

    def parse(self, response): 

     for article in response.xpath('//article'): 
      gag_id = article.xpath('@data-entry-id').extract() 
      ninegag_item = GagItem() 
      ninegag_item['entry_id'] = gag_id[0] 
      ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0] 
      ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0] 
      ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0] 
      ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip() 
      ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract() 
      self.last_gag_id = gag_id[0] 
      self.count = self.count + 1 
      yield ninegag_item 

     if (self.count < self.COUNT_MAX): 
      next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id 
      yield scrapy.Request(url=next_url, callback=self.parse) 
+0

有没有办法找到了完成刮的时候? –

+0

工作得很好Thankx @Frank –

0

count是本地parse()方法,所以它不会保留页面之间。将所有发生的count更改为self.count以使其成为类的实例变量,并且它将在页面之间保持不变。

0

蜘蛛参数使用-a option.check link

2

有一个内置的设置CLOSESPIDER_PAGECOUNT,可以通过命令行-s参数传递或改变设置:scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

一个小小的警告,如果您启用缓存,它也会计数缓存命中数。