2015-02-24 53 views
4

我可以运行在Python脚本从维基以下几招爬行:传递参数给一个Python脚本内scrapy蜘蛛

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy import log, signals 
from testspiders.spiders.followall import FollowAllSpider 
from scrapy.utils.project import get_project_settings 

spider = FollowAllSpider(domain='scrapinghub.com') 
settings = get_project_settings() 
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
crawler.configure() 
crawler.crawl(spider) 
crawler.start() 
log.start() 
reactor.run() 

正如你可以看到我可以只通过domainFollowAllSpider但我的问题是那我怎么能通过start_urls(实际上是一个date将被添加到一个固定的URL)到我的蜘蛛类使用上面的代码?

这是我的蜘蛛类:

class MySpider(CrawlSpider): 
    name = 'tw' 
    def __init__(self,date): 
     y,m,d=date.split('-') #this is a test , it could split with regex! 
     try: 
      y,m,d=int(y),int(m),int(d) 

     except ValueError: 
      raise 'Enter a valid date' 

     self.allowed_domains = ['mydomin.com'] 
     self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)] 

    def parse(self, response): 
     questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
     for question in questions: 
      item = PoptopItem() 
      item['url'] = question.extract() 
      yield item['url'] 

,这是我的脚本:

from pdfcreator import convertor 
from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy import log, signals 
#from testspiders.spiders.followall import FollowAllSpider 
from scrapy.utils.project import get_project_settings 
from poptop.spiders.stackoverflow_spider import MySpider 
from poptop.items import PoptopItem 

settings = get_project_settings() 
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) 
crawler.configure() 

date=raw_input('Enter the date with this format (d-m-Y) : ') 
print date 
spider=MySpider(date=date) 
crawler.crawl(spider) 
crawler.start() 
log.start() 
item=PoptopItem() 

for url in item['url']: 
    convertor(url) 

reactor.run() # the script will block here until the spider_closed signal was sent 

,如果我只是打印item我会得到以下错误:

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17> 

项目:

import scrapy 


class PoptopItem(scrapy.Item): 
    titles= scrapy.Field() 
    content= scrapy.Field() 
    url=scrapy.Field() 

回答

7

您需要修改您的__init__()构造函数以接受date参数。另外,我会用datetime.strptime()解析日期字符串:

from datetime import datetime 

class MySpider(CrawlSpider): 
    name = 'tw' 
    allowed_domains = ['test.com'] 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 

     date = kwargs.get('date') 
     if not date: 
      raise ValueError('No date given') 

     dt = datetime.strptime(date, "%m-%d-%Y") 
     self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)] 

然后,你会实例蜘蛛是这样的:

spider = MySpider(date='01-01-2015') 

或者,你甚至可以不惜一切解析日期,传递datetime实例摆在首位:

class MySpider(CrawlSpider): 
    name = 'tw' 
    allowed_domains = ['test.com'] 

    def __init__(self, *args, **kwargs): 
     super(MySpider, self).__init__(*args, **kwargs) 

     dt = kwargs.get('dt') 
     if not dt: 
      raise ValueError('No date given') 

     self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)] 

spider = MySpider(dt=datetime(year=2014, month=01, day=01)) 

而且,仅供参考,请参阅this answer为有关如何从脚本运行Scrapy一个详细的例子。

+0

非常感谢您的解释!正如我所说的,时间解析器是一个测试!也感谢链接建议,现在你可以看到我的'parse'函数产生'url'我怎么能得到它? (在爬行后) – Kasramvd 2015-02-25 12:53:01

+0

我使用的项目,但它引发'KeyError'似乎它不运行爬行! '在URL ['url']中的URL:' – Kasramvd 2015-02-25 13:01:14

+0

@KasraAD我认为你只需要'yield item'而不是'yield item ['url']''。让我知道它是否有帮助。 – alecxe 2015-02-25 13:15:28