2012-07-06 74 views
1

链接我已经创建了一个扩展CrawlSpider,随后在http://scrapy.readthedocs.org/en/latest/topics/spiders.html不能遵循使用Scrapy

问题的建议蜘蛛是我需要解析两个起始URL(这恰好吻合与主机名)以及它所拥有的一些链接。

所以我定义了一条规则:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)],但没有任何反应。

然后我试着定义一组规则,如:rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule(SgmlLinkExtractor(allow=['/']), callback='parse_items', follow=True)]。现在的问题是,蜘蛛解析一切。

我该如何告诉蜘蛛解析_start_url_以及它只包含一些链接?

更新:

我试图重写parse_start_url方法,所以现在我能够从一开始就获得页面的数据,但它仍然没有遵循与Rule定义的链接:

class ExampleSpider(CrawlSpider): 
    name = 'TechCrunchCrawler' 
    start_urls = ['http://techcrunch.com'] 
    allowed_domains = ['techcrunch.com'] 
    rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)] 

    def parse_start_url(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     return self.parse_links(response) 

    def parse_links(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     articles = [] 
     for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'): 
      article = Article() 
      article['title'] = i.select('./@title').extract() 
      article['link'] = i.select('./@href').extract() 
      articles.append(article) 

     return articles 
+1

ü可以张贴一些的我们的代码在这里识别以及 – 2012-07-10 09:36:28

回答

1

我在过去有类似的问题。
我坚持BaseSpider。

试试这个:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 
from scrapy.contrib.loader import XPathItemLoader 

from techCrunch.items import Article 


class techCrunch(BaseSpider): 
    name = 'techCrunchCrawler' 
    allowed_domains = ['techcrunch.com'] 

    # This gets your start page and directs it to get parse manager 
    def start_requests(self): 
     return [Request("http://techcrunch.com", callback=self.parseMgr)] 

    # the parse manager deals out what to parse and start page extraction 
    def parseMgr(self, response): 
     print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++' 
     yield self.pageParser(response) 

     nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract() 
     if nextPage: 
      yield Request(nextPage[0], callback=self.parseMgr) 

    # The page parser only parses the pages and returns items on each page call 
    def pageParser(self, response): 
     print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++' 
     loader = XPathItemLoader(item=Article(), response=response) 
     loader.add_xpath('title', '//h2[@class="headline"]/a/@title') 
     loader.add_xpath('link', '//h2[@class="headline"]/a/@href') 
     return loader.load_item() 
1

你忘记反斜杠转义字母d为\d

>>> SgmlLinkExtractor(allow=r'/page/d+').extract_links(response) 
[] 
>>> SgmlLinkExtractor(allow=r'/page/\d+').extract_links(response) 
[Link(url='http://techcrunch.com/page/2/', text=u'Next Page',...)]