2012-01-31 48 views
0

有人可以检查下面的代码是否正确? 代码在 http://readthedocs.org/docs/scrapy/en/0.14/topics/exporters.htmlscrapy文档中可能不正确的蜘蛛/导出器示例代码

发现我认为这是不正确的原因是:

  • 类保存了多个同时打开的文件多蜘蛛轨道,但是:
  • 出口商(这依赖于文件)在每次打开新蜘蛛时被覆盖。

感谢您的任何帮助。

class XmlExportPipeline(object): 

    def __init__(self): 
     dispatcher.connect(self.spider_opened, signals.spider_opened) 
     dispatcher.connect(self.spider_closed, signals.spider_closed) 
     self.files = {} 

    def spider_opened(self, spider): 
     file = open('%s_products.xml' % spider.name, 'w+b') 
     self.files[spider] = file 
     self.exporter = XmlItemExporter(file) 
     self.exporter.start_exporting() 

    def spider_closed(self, spider): 
     self.exporter.finish_exporting() 
     file = self.files.pop(spider) 
     file.close() 

    def process_item(self, item, spider): 
     self.exporter.export_item(item) 
     return item 

回答

1

我觉得这个问题应该在scrapy-users group询问。

AFAIK,从v0.14开始Scrapy在一个进程中不支持多个蜘蛛(related discussion),所以这段代码可以正常工作。而对于多蜘蛛明显的解决方法是让exporters字典与spider键:

class XmlExportPipeline(object): 

    def __init__(self): 
     dispatcher.connect(self.spider_opened, signals.spider_opened) 
     dispatcher.connect(self.spider_closed, signals.spider_closed) 
     self.files = {} 
     self.exporters = {} 

    def spider_opened(self, spider): 
     file = open('%s_products.xml' % spider.name, 'w+b') 
     self.files[spider] = file 
     self.exporters[spider] = XmlItemExporter(file) 
     self.exporters[spider].start_exporting() 

    def spider_closed(self, spider): 
     self.exporters[spider].finish_exporting() 
     file = self.files.pop(spider) 
     file.close() 

    def process_item(self, item, spider): 
     self.exporters[spider].export_item(item) 
     return item 
+0

谢谢,这是启发 – mskel 2012-02-05 05:55:15