Scrapy CLOSESPIDER_PAGECOUNT设置不应该如我所愿

我使用scrapy 1.0.3并且无法发现CLOSESPIDER如何工作。对于命令： scrapy抓取domain_links --set = CLOSESPIDER_PAGECOUNT = 1 是正确一个requst，但是对于两页计数： scrapy抓取domain_links --set CLOSESPIDER_PAGECOUNT = 2 是请求的无穷大。Scrapy CLOSESPIDER_PAGECOUNT设置不应该如我所愿

所以请在简单的例子中解释它是如何工作的。

这是我的蜘蛛代码：

class DomainLinksSpider(CrawlSpider): 
    name = "domain_links" 
    #allowed_domains = ["www.example.org"] 
    start_urls = [ "www.example.org/",] 

    rules = (

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'), 
    ) 

    def parse_page(self, response): 
     print '<<<',response.url 
     items = [] 
     item = PathsSpiderItem() 

     selected_links = response.selector.xpath('//a[@href]') 

     for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response): 
      item = PathsSpiderItem() 
      item['url'] = link.url 
      items.append(item) 
     return items

甚至没有为这个简单的蜘蛛工作：

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class ExampleSpider(CrawlSpider): 
    name = 'example' 
    allowed_domains = ['karen.pl'] 
    start_urls = ['http://www.karen.pl'] 

    rules = (
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 


     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
    Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'), 
    ) 

    def parse_item(self, response): 
     self.logger.info('Hi, this is an item page! %s', response.url) 
     item = scrapy.Item() 

     return item

，但不是无限：

scrapy爬行例如--set CLOSESPIDER_PAGECOUNT = 1 'downloader/request_count'：1,

sc rapy抓取例如--set CLOSESPIDER_PAGECOUNT = 2 '下载/ REQUEST_COUNT'：17，

scrapy抓取例如--set CLOSESPIDER_PAGECOUNT = 3 '下载/ REQUEST_COUNT'：19，

莫比这是因为平行的downolading。是的，对于CONCURRENT_REQUESTS = 1，CLOSESPIDER_PAGECOUNT设置适用于第二个示例。我将检查第一个 - 它也可以。这几乎是无穷大，我becouse，网站地图有很多网址（我的项目）被抓取的:)

来源

2015-12-30 Thomas

您确定要退货吗？不是一件一件地“退货”吗？我更喜欢使用BaseSpider，但它看起来像parse_page被称为无限次，而不真的产生任何项目？ – Turo

我觉得没关系。启发： https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/dmoz.py 但当然这个例子有点更新。 – Thomas

图罗，感谢您的建议 - 这是很好的内存优化方式。 – Thomas

CLOSESPIDER_PAGECOUNT由CloseSpider扩展，计算每个响应，直到当它告诉履带达到极限这是控制下页进程开始结束（完成请求并关闭可用时隙）。

现在为什么当你指定CLOSESPIDER_PAGECOUNT=1你的蜘蛛结束的原因是因为在那一刻（当它到达它的第一反应）没有挂起请求，正在你的第一个之后创建他们，所以履带过程已经准备好结束了，没有考虑到下面的过程（因为它们将在第一个之后出生）。

当您指定CLOSESPIDER_PAGECOUNT>1时，抓住您的蜘蛛创建请求并填充请求队列。当蜘蛛知道何时完成还有待处理的请求，这些请求是作为关闭蜘蛛的一部分执行的。

来源

2015-12-30 18:50:56 eLRuLL

这有助于理解本周了解closespider_pagecount，谢谢 – tristanbailey

Scrapy CLOSESPIDER_PAGECOUNT设置不应该如我所愿

回答

相关问题