Scrapy店返回变量项主要脚本

使用我很新的Scrapy，想尝试以下操作：从网页中提取一些值，将其存储在一个变量，在我的主要脚本中使用它。所以我也跟着他们的教程，并改变了代码为我的目的：Scrapy店返回变量项主要脚本

import scrapy 
from scrapy.crawler import CrawlerProcess 


class QuotesSpider(scrapy.Spider): 
    name = "quotes" 
    start_urls = [ 
     'http://quotes.toscrape.com/page/1/' 
    ] 

    custom_settings = { 
     'LOG_ENABLED': 'False', 
    } 

    def parse(self, response): 
     global title # This would work, but there should be a better way 
     title = response.css('title::text').extract_first() 

process = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
}) 

process.crawl(QuotesSpider) 
process.start() # the script will block here until the crawling is finished 

print(title) # Verify if it works and do some other actions later on...

这将工作至今，但我敢肯定它不是一个良好的作风，甚至有一些不良的副作用，如果我定义标题变量为全局。如果我跳过那一行，那么我会得到“未定义的变量”错误当然是：/ 因此，我正在寻找一种方法来返回变量并在我的主脚本中使用它。

我已阅读关于物品管道，但我无法使其工作。

任何帮助/想法都非常感谢:) 在此先感谢！

来源

2017-12-27 MaGi

更好地利用'global' - 它会更容易。管道不会帮助你。 – furas

使用global因为你知道是不是一个很好的风格，特别是当你需要扩展需求。

我的建议是标题存储到文件或目录，并在主过程中使用它，或者如果你想处理其他脚本的标题，然后只需打开文件，并在你的脚本

阅读题（注：请忽略压痕问题）

spider.py

import scrapy 
from scrapy.crawler import CrawlerProcess 

namefile = 'namefile.txt' 
current_title_session = []#title stored in current session 
file_append = open(namefile,'a',encoding = 'utf-8') 

try: 
    title_in_file = open(namefile,'r').readlines() 
except: 
    title_in_file = open(namefile,'w') 

class QuotesSpider(scrapy.Spider): 
    name = "quotes" 
    start_urls = [ 
     'http://quotes.toscrape.com/page/1/' 
    ] 

    custom_settings = { 
     'LOG_ENABLED': 'False', 
    } 

    def parse(self, response): 
     title = response.css('title::text').extract_first() 
     if title +'\n' not in title_in_file and title not in current_title_session: 
      file_append.write(title+'\n') 
      current_title_session.append(title) 
if __name__=='__main__': 
    process = CrawlerProcess({ 
     'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' 
    }) 

    process.crawl(QuotesSpider) 
    process.start() # the script will block here until the crawling is finished

来源

2017-12-29 03:55:54 AndyWang

谢谢，这解决与全球语句的问题，虽然我不知道如果是优雅创建另一个文件来处理它。反正 - 这对我来说工作得很好:-) – MaGi

制作一个变量global应该为你所需要的工作，但正如你所说的那样，它不是很好的风格。

我真的建议使用不同的服务进程之间的通信，像Redis，所以你不会有你的蜘蛛和任何其他过程之间的冲突。

设置和使用非常简单，文档有一个very simple example。

实例化于主过程中的蜘蛛，并再次内部的redis的连接（思考它们作为单独的进程）。蜘蛛设置变量和主要过程读取（或get）的信息。

来源

2017-12-27 14:46:31 eLRuLL

谢谢，在短期内，我会去furas'和AndyWangs回答，但如果我的时候，我会读入Redis的:) – MaGi

Scrapy店返回变量项主要脚本

回答

相关问题