2016-09-16 75 views
1

我一直坚持这几天,它让我疯了。Scrapy:__init__中设置的规则被CrawlSpider忽略

我打电话给我的scrapy蜘蛛这样的:

scrapy crawl example -a follow_links="True" 

我通过在“FOLLOW_LINKS”标志,以确定整个网站是否应该被刮掉,或者只是索引页我在蜘蛛定义。

这个标志在蜘蛛的构造检查,看哪些规则应该设置:

def __init__(self, *args, **kwargs): 

    super(ExampleSpider, self).__init__(*args, **kwargs) 

    self.follow_links = kwargs.get('follow_links') 
    if self.follow_links == "True": 
     self.rules = (
      Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
     ) 
    else: 
     self.rules = (
      Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
     ) 

如果是“真”,各个环节都允许的;如果它是“假”,则所有链接都被拒绝。

到目前为止,这么好,但是这些规则被忽略了。我可以遵循规则的唯一方法是如果我在构造函数之外定义它们。这意味着,像这样的正常工作:

class ExampleSpider(CrawlSpider): 

    rules = (
     Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
    ) 

    def __init__(self, *args, **kwargs): 
     ... 

因此,基本上,定义__init__构造函数中的规则会导致忽略的规则,而定义之外的构造函数按预期工作的规则。

我无法理解这一点。我的代码如下。

import re 
import scrapy 

from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content 


class ExampleSpider(CrawlSpider): 

    name = "example" 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com']  
    # if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages) 
    # rules = (
    #  Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
    #) 

    def __init__(self, *args, **kwargs): 

     super(ExampleSpider, self).__init__(*args, **kwargs) 

     # single page or follow links 
     self.follow_links = kwargs.get('follow_links') 
     if self.follow_links == "True": 
      # the rule below will always be ignored (why?!) 
      self.rules = (
       Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True), 
      ) 
     else: 
      # the rule below will always be ignored (why?!) 
      self.rules = (
       Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False), 
      ) 


    def parse_pages(self, response): 
     print("In parse_pages") 
     print(response.xpath('/html/body').extract()) 
     return None 


    def parse_start_url(self, response): 
     print("In parse_start_url") 
     print(response.xpath('/html/body').extract()) 
     return None 

谢谢您花时间帮助我解决这个问题。

+2

你可以尝试调用之前设置你的规则'超(ExampleSpider,...' – eLRuLL

+1

@eLRuLL,你应该张贴此作为一个答案 –

回答

4

的这里的问题是,CrawlSpider构造(__init__)也处理rules参数,所以如果你需要给它们,你就必须调用默认构造函数之前做到这一点。

换句话说你叫super(ExampleSpider, self).__init__(*args, **kwargs)之前所需要的一切:

def __init__(self, *args, **kwargs): 
    # setting my own rules 
    super(ExampleSpider, self).__init__(*args, **kwargs) 
+0

这就是它,你必须在调用super()之前设置你的规则,start_urls和allowed_domains。 –