2015-06-21 105 views
5

我正在使用两阶段爬网来聚合Scrapy的日常数据。第一阶段从索引页面生成URL列表,第二阶段将列表中每个URL的HTML写入Kafka主题。Scrapy`ReactorNotRestartable`:一个类运行两个(或更多)蜘蛛

kafka cluster for Scrapy crawler

虽然抓取的两个组成部分是相关的,我想他们是独立的:在url_generator将运行一个计划任务,每天一次,page_requester将连续运行,处理URL的时候可用。为了“礼貌”,我会调整DOWNLOAD_DELAY,以便履带车在24小时内完成,但是对现场施加的负荷最小。

我创建了一个CrawlerRunner类具有功能生成的URL和检索HTML:

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy import log, signals 
from scrapy_somesite.spiders.create_urls_spider import CreateSomeSiteUrlList 
from scrapy_somesite.spiders.crawl_urls_spider import SomeSiteRetrievePages 
from scrapy.utils.project import get_project_settings 
import os 
import sys 

class CrawlerRunner: 

    def __init__(self): 
     sys.path.append(os.path.join(os.path.curdir, "crawl/somesite")) 
     os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_somesite.settings' 
     self.settings = get_project_settings() 
     log.start() 

    def create_urls(self): 
     spider = CreateSomeSiteUrlList() 
     crawler_create_urls = Crawler(self.settings) 
     crawler_create_urls.signals.connect(reactor.stop, signal=signals.spider_closed) 
     crawler_create_urls.configure() 
     crawler_create_urls.crawl(spider) 
     crawler_create_urls.start() 
     reactor.run() 

    def crawl_urls(self): 
     spider = SomeSiteRetrievePages() 
     crawler_crawl_urls = Crawler(self.settings) 
     crawler_crawl_urls.signals.connect(reactor.stop, signal=signals.spider_closed) 
     crawler_crawl_urls.configure() 
     crawler_crawl_urls.crawl(spider) 
     crawler_crawl_urls.start() 
     reactor.run() 

当我实例化类,我能够成功执行其自己的任一功能,但遗憾的是,我无法执行在一起:当它试图在crawl_urls函数来执行reactor.run()

from crawl.somesite import crawler_runner 

cr = crawler_runner.CrawlerRunner() 

cr.create_urls() 
cr.crawl_urls() 

第二个函数调用生成twisted.internet.error.ReactorNotRestartable

我想知道这个代码是否有一个简单的修复方法(例如运行两个单独的Twisted reactor的方法),或者如果有更好的方法来构建这个项目。

回答

6

通过保持反应器打开直到所有蜘蛛已停止运行,可以在一个反应​​器内运行多个蜘蛛。这是通过将所有运行蜘蛛的列表,而不是执行reactor.stop()直到这个列表是空的实现:执行

import sys 
import os 
from scrapy.utils.project import get_project_settings 
from scrapy_somesite.spiders.create_urls_spider import Spider1 
from scrapy_somesite.spiders.crawl_urls_spider import Spider2 

from scrapy import signals, log 
from twisted.internet import reactor 
from scrapy.crawler import Crawler 

class CrawlRunner: 

    def __init__(self): 
     self.running_crawlers = [] 

    def spider_closing(self, spider): 
     log.msg("Spider closed: %s" % spider, level=log.INFO) 
     self.running_crawlers.remove(spider) 
     if not self.running_crawlers: 
      reactor.stop() 

    def run(self): 

     sys.path.append(os.path.join(os.path.curdir, "crawl/somesite")) 
     os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_somesite.settings' 
     settings = get_project_settings() 
     log.start(loglevel=log.DEBUG) 

     to_crawl = [Spider1, Spider2] 

     for spider in to_crawl: 

      crawler = Crawler(settings) 
      crawler_obj = spider() 
      self.running_crawlers.append(crawler_obj) 

      crawler.signals.connect(self.spider_closing, signal=signals.spider_closed) 
      crawler.configure() 
      crawler.crawl(crawler_obj) 
      crawler.start() 

     reactor.run() 

类:

from crawl.somesite.crawl import CrawlRunner 

cr = CrawlRunner() 
cr.run() 

该解决方案是基于一个blogpost by Kiran Koduru

+0

有没有办法在运行时将爬行器添加到reactor?如何做到这一点reactor.run()阻塞? –

+1

感谢信用:) –