2014-12-03 98 views
0

我想抓取一些使用scrapy的网站。以下是一个示例代码。方法解析没有被调用。我试图通过反应堆服务(代码提供)运行代码。所以,我从拥有反应堆的startCrawling.py运行它。我知道我错过了一些东西。你能帮忙吗?Python Scrapy-无法抓取

感谢,

Code-categorization.py 

from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 
from scrapy.selector import Selector 
from scrapy.selector import HtmlXPathSelector 
from items.items import CategorizationItem 
from scrapy.contrib.spiders.crawl import CrawlSpider 
class TestingSpider(CrawlSpider): 
     print 'in spider' 
     name = 'testSpider' 
     allowed_domains = ['wikipedia.org'] 
     start_urls = ['http://www.wikipedia.org'] 
     def parse(self, response): 

      # Scrape data from page 
      print 'here' 
      open('test.html','wb').write(response.body) 

代码 - startCrawling.py

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy.settings import Settings 
from scrapy import log, signals 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy.utils.project import get_project_settings 

from spiders.categorization import TestingSpider 

# Scrapy spiders script... 

def stop_reactor(): 
    reactor.stop #@UndefinedVariable  
    print 'hi' 

    dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
    spider = TestingSpider() 
    crawler = Crawler(Settings()) 
    crawler.configure() 
    crawler.crawl(spider) 
    crawler.start() 
    reactor.run()#@UndefinedVariable 

回答

2

你不应该使用CrawlSpider时覆盖parse()方法。您应该在Rule中以不同的名称设置自定义callback
这里是从official documentation摘录:

当写抓取蜘蛛规则,应避免使用解析作为回调,由于 的CrawlSpider使用解析方法本身执行其逻辑。 因此,如果您重写解析方法,抓取蜘蛛将不再工作 。

+0

谢谢。我正在接受答案。我会试试这个,让你知道。 – user1930402 2014-12-05 09:26:20

+1

有史以来最快的接受,我只是点击,它变成了绿色:) – bosnjak 2014-12-05 09:26:46