2017-07-30 47 views
0

解析我一直在使用Scrapy并试图遵循例子只能跟着网址匹配某种正则表达式的那个。Scrapy CrawlSpider - 不能按照特定的链接或自定义的处理器

我不是一个Python开发,但我已经尝试了很多方法,试图让这是怎么回事。

我在Scrapy文档中使用了示例URL,并且从CrawlSpider延伸并通过LinkExtractor实现了规则。

目前,我想只使用一个自定义的解析器对任何URL的包含在他们所说的“朋友”。

** Scrapy Python的蜘蛛**

import scrapy 

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class MySpider(CrawlSpider): 
    name = 'example' 

    allowed_domains = ['quotes.toscrape.com'] 

    start_urls = ['http://quotes.toscrape.com'] 

    rules = [ 
     Rule(LinkExtractor(allow='(friends)'), callback='parse_custom') 
    ] 

    def parse(self, response): 

     self.logger.info('1111111111111 - Parsing General URL! %s', response.url) 

     for href in response.css('a::attr(href)'): 
      yield response.follow(href, callback=self.parse) 

    def parse_custom(self, response): 
     # I have never been able to get this to call 
     self.logger.info('2222222222222 - Parsing CUSTOM URL! %s', response.url) 

     for href in response.css('a::attr(href)'): 
      yield response.follow(href, callback=self.parse) 

日志文件

2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/miracles/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/miracle/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/live/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/life/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/inspirational/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/choices/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/abilities/page/1/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/simile/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/miracles/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/miracle/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/live/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/life/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/inspirational/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/choices/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/abilities/page/1/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/simile/ 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/truth/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/friends/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/friendship/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/reading/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/books/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com) 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/truth/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/author/Marilyn-Monroe/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/friends/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/friendship/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/reading/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/books/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/humor/ 
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/author/Jane-Austen/ 

回答

2

documentation

当编写蜘蛛抓取规则,避免使用parse回调,由于CrawlSpider使用parse方法本身来实现它的逻辑。因此,如果您覆盖parse方法,抓取蜘蛛将不再起作用。

+0

谢谢,我甚至有一部分记得在阅读文档时阅读这些内容。 –