2016-09-07 61 views
3

我想抓取一个包含2个部分的网站,而且我的脚本没有我需要的那么快。在scrapy中为1个网站并行运行多个蜘蛛吗?

是否有可能l 2只蜘蛛,一只用于刮第一部分和第二部分的第二部分?

我想有2个不同势类,并运行它们

scrapy crwal firstSpider 
scrapy crawl secondSpider 

,但我认为这是不聪明。

我读了documentation of scrapyd但我不知道它是否适合我的情况。

回答

5

我想你在找什么是这样的:

import scrapy 
from scrapy.crawler import CrawlerProcess 

class MySpider1(scrapy.Spider): 
    # Your first spider definition 
    ... 

class MySpider2(scrapy.Spider): 
    # Your second spider definition 
    ... 

process = CrawlerProcess() 
process.crawl(MySpider1) 
process.crawl(MySpider2) 
process.start() # the script will block here until all crawling jobs are finished 

你可以读到更多在:running-multiple-spiders-in-the-same-process

+0

感谢的人,这正是我需要的 – parik

3

或者你可以运行这样,你需要在与scrapy.cfg同一目录中保存此代码(我scrapy版本是1.3.3):

from scrapy.utils.project import get_project_settings 
from scrapy.crawler import CrawlerProcess 

setting = get_project_settings() 
process = CrawlerProcess(setting) 

for spider_name in process.spiders.list(): 
    print ("Running spider %s" % (spider_name)) 
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy 

process.start() 
+0

它也能工作,谢谢 – parik

+0

[twisted] CRITICAL:未处理的延迟错误: – zhilevan

1

更好的方法是(如果你有多个蜘蛛)它动态地获取蜘蛛并运行它们。

from scrapy import spiderloader 
from scrapy.utils import project 
from twisted.internet.defer import inlineCallbacks 


@inlineCallbacks 
def crawl(): 
    settings = project.get_project_settings() 
    spider_loader = spiderloader.SpiderLoader.from_settings(settings) 
    spiders = spider_loader.list() 
    classes = [spider_loader.load(name) for name in spiders] 
    for my_spider in classes: 
     yield runner.crawl(my_spider) 
    reactor.stop() 

crawl() 
reactor.run() 

(第二方案): 因为spiders.list()在Scrapy弃用1.4裕达解决方案应转换为类似

from scrapy.utils.project import get_project_settings 
from scrapy.crawler import CrawlerProcess 

setting = get_project_settings() 
spider_loader = spiderloader.SpiderLoader.from_settings(settings) 

for spider_name in spider_loader.list(): 
    print ("Running spider %s" % (spider_name)) 
    process.crawl(spider_name) 
process.start()