如何获得工作scrapy crawler作为变量的结果，python？

我有大约11M的网址应该被解析并且数据应该被提取。我想用Scrapy Crawler来做。我创建如何获得工作scrapy crawler作为变量的结果，python？

架构有一个代码文件start_script.py：

import os 
import sys 

def main(): 
    spider_name = 'example' 
    with open('file.csv', 'rb') as csvfile: 
     reader = csv.reader(csvfile, delimiter = ',') 
     for link in reader: 
      os.system('scrapy crawl %s -a link=%s -o %s -t csv' % (spider_name, link, filename)) 

if __name__ == '__main__': 
    main()

和scrpay履带。

我必须将解析数据和附加信息的结果存储到文件中，最有效的方法是保持打开此文件并写入文件。

因此，有没有什么办法可以将Scrapy的结果从start_script.py文件中抓取到变量中？也许存在任何其他方式使用scrapy做到这一点？

我试过阅读scrapy文档（http://doc.scrapy.org/）。我试过搜索答案和相关的问题到StackOverflow（https://stackoverflow.com/questions/ask/advice？）。当然，我最常见的尝试在Google中找到答案（https://www.google.com）。

正如你可能理解的结果是没有什么！

任何答案，意见和想法将是有用的，请记住，我需要使用Scrapy或100％确定它是不可能的。

来源

2015-08-15 Dmitriy Chasovskoy

它取决于形势的所有网站的结构是否相同？你会把所有的计算都塞进一个蜘蛛吗？一个网站的蜘蛛？请具体对你的工作 – Vasim

我想废一个网站，很多链接都来自同一个网站。 –

好吧，那么为什么你不使用物品管道？在你的scrapy项目下，你必须写下蜘蛛的定义。对？？？只需使用物品管道将您的报废结果写入CSV或JSON文件即可。 – Vasim

你有什么尝试，直到现在？发布您的蜘蛛定义源代码。

好。在您的scrapy项目下，存在一个python文件“pipelines.py”。以下代码附加在这个文件中：

import csv 

class myExporter(object): 

    def __init__(self): 
     self.myCSV = csv.writer(open('filename.csv', 'wb')) 
     self.myCSV.writerow(['field1', 'field2',...]) 

    def process_item(self, item, spider): 
     self.myCSV.writerow([item['field1'], item['field2'], item['field3'],...]) 

    return item

现在，打开settings.py文件并添加以下代码：

ITEM_PIPELINES = ['your_Project_Name.pipelines.myExporter']

希望，这将工作.. !!! :)

来源

2015-08-17 13:04:55 Vasim

我喜欢这个答案，但根据上面的帖子，我推荐使用ITEM_PIPELINES = {'PROJECTNAME.pipelines.YourPipeline'：300}。不幸的是，它仍然不是一个答案，因为我需要每次关闭和打开文件的问题。我真正想要的是打开文件，我可以写数据，不仅是请求的结果，还有其他特定的数据。我需要类似shell的文件来控制这个过程。在这样的架构中有很多棘手的事情，因此，如果你仍然有任何想法，我会很乐意听到他们的意见。谢谢。 –

使用CsvItemExporter轻松地将数据导出到csv。

在items.py文件：

import scrapy 
    class YourItem(scrapy.Item): 
     # define the fields for your item here like: 
     # name = scrapy.Field() 
     field1 = scrapy.Field() 
     field2 = scrapy.Field() 
     extradata = scrapy.Field()

在pipelines.py文件：

from scrapy import signals 
    from scrapy.exporters import CsvItemExporter 
    from scrapy.exceptions import DropItem 
    class YourPipeline(object): 
     def __init__(self): 
      # Initialize PIPELINE 
      self.files = {} 
      self.ids_seen = list() 

     @classmethod 
     def from_crawler(cls, crawler): 
      pipeline = cls() 
      crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) 
      crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) 
      crawler.signals.connect(pipeline.spider_error, signals.spider_error) 
      crawler.signals.connect(pipeline.item_dropped, signals.item_dropped) 
      return pipeline 

     def item_dropped(self,item, response, exception, spider): 
      # ITEM dropped from pipeline 


     def spider_error(self,failure, response, spider): 
      # SPIDER encountered error 

     def spider_opened(self, spider): 
      # SPIDER opened 
      file = open('filename.csv', 'w+b') 
      self.files[spider] = file 
      self.exporter = CsvItemExporter(file) 
      self.exporter.start_exporting() 

     def spider_closed(self, spider): 
      # SPIDER closed 
      self.exporter.finish_exporting() 
      file = self.files.pop(spider) 
      file.close() 

     def process_item(self, item, spider): 
      # Process ITEM 
      if item['UNIQUEIDOFYOURCHOICE'] in self.ids_seen: 
       raise DropItem("Duplicate item found: %s" % item) 
      else: 
       self.ids_seen.append(item['UNIQUEIDOFYOURCHOICE']) 
       self.exporter.export_item(item) 
       return item

激活settings.py中

ITEM_PIPELINES = {'PROJECTNAME.pipelines.YourPipeline': 300,}

上述管道的管道将导出到指定的CSV文件，其中项目字段名称为标题以及删除重复项。根据您要指定任何唯一键TES（http://doc.scrapy.org/en/latest/topics/exporters.html?highlight=csvitemexporter）

从SPIDER将数据发送到管道：

from items import YourItem 

    for VALUE in SETOFVALUES: 
     item = YourItem() 
     item['field1'] = 'SOME VALUE' 
     item['field2'] = 'SOME VALUE2' 
     item['extradata'] = 'SOMEEXTRADATA' 
     # yield WILL SEND THE ITEM WITH DATA CURRENTLY ASSIGNED TO IT TO PIPELINE 
     yield item

来源

2015-08-17 13:57:00

如何获得工作scrapy crawler作为变量的结果，python？

回答

相关问题