我使用scrapy 1.0.3并且无法发现CLOSESPIDER如何工作。 对于命令: scrapy抓取domain_links --set = CLOSESPIDER_PAGECOUNT = 1 是正确一个requst,但是对于两页计数: scrapy抓取domain_links --set CLOSESPIDER_PAGECOUNT = 2 是请求的无穷大。Scrapy CLOSESPIDER_PAGECOUNT设置不应该如我所愿
所以请在简单的例子中解释它是如何工作的。
这是我的蜘蛛代码:
class DomainLinksSpider(CrawlSpider):
name = "domain_links"
#allowed_domains = ["www.example.org"]
start_urls = [ "www.example.org/",]
rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'),
)
def parse_page(self, response):
print '<<<',response.url
items = []
item = PathsSpiderItem()
selected_links = response.selector.xpath('//a[@href]')
for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response):
item = PathsSpiderItem()
item['url'] = link.url
items.append(item)
return items
甚至没有为这个简单的蜘蛛工作:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['karen.pl']
start_urls = ['http://www.karen.pl']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = scrapy.Item()
return item
,但不是无限:
scrapy爬行例如--set CLOSESPIDER_PAGECOUNT = 1 'downloader/request_count':1,
sc rapy抓取例如--set CLOSESPIDER_PAGECOUNT = 2 '下载/ REQUEST_COUNT':17,
scrapy抓取例如--set CLOSESPIDER_PAGECOUNT = 3 '下载/ REQUEST_COUNT':19,
莫比这是因为平行的downolading。 是的,对于CONCURRENT_REQUESTS = 1,CLOSESPIDER_PAGECOUNT设置适用于第二个示例。我将检查第一个 - 它也可以。 这几乎是无穷大,我becouse,网站地图有很多网址(我的项目)被抓取的:)
您确定要退货吗?不是一件一件地“退货”吗?我更喜欢使用BaseSpider,但它看起来像parse_page被称为无限次,而不真的产生任何项目? – Turo
我觉得没关系。启发: https://github.com/scrapy/dirbot/blob/master/dirbot/spiders/dmoz.py 但当然这个例子有点更新。 – Thomas
图罗,感谢您的建议 - 这是很好的内存优化方式。 – Thomas