2013-03-22 81 views
2

我一直在尝试创建一个简单的Scrapy CrawlSpider脚本,该脚本可以很容易地进行更改,但我无法弄清楚如何让链接提取器规则正常工作。在Scrapy中创建可编辑的CrawlSpider规则

这里是我的代码:

class LernaSpider(CrawlSpider): 
"""Our ad-hoc spider""" 

name = "lerna" 

def __init__(self, url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction=''): 
    parsed_url = urlparse(url) 
    domain = str(parsed_url.netloc) 
    self.allowed_domains = [domain] 
    self.start_urls = [url] 
    self.rules = (
     # Extract links 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(SgmlLinkExtractor(allow=(allow_follow,), deny=(deny_follow,))), 

     # Extract links and parse them with the spider's method parse_item 
     Rule(SgmlLinkExtractor(allow=(allow_extraction,), deny=(deny_extraction,)), callback='parse_item'), 
    ) 

    super(LernaSpider, self).__init__() 

def parse_item(self, response): 

    print 'Crawling... %s' % response.url 
    # more stuff here 

我有这样的代码,但我从来没有得到过允许/拒绝规则,以正常工作,我真的不知道为什么。是否留下空弦使它拒绝一切?我认为,因为这是一个RE,如果我输入'。*'或其他什么,它只会做一个全面的否定。

任何帮助,将不胜感激。

回答

3

你是在自己实例化蜘蛛吗?是这样的:

spider = LernaSpider('http://example.com') 

,否则如果你是从你使用的URL作为构造函数的第一个参数不正确的命令行运行$scrapy crawl lerna(应该是名字),你也不会再传递到超。也许试试这个:

class LernaSpider(CrawlSpider): 
    """Our ad-hoc spider""" 

    name = "lerna" 

    def __init__(self, name=None, url=url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction='', **kw): 
     parsed_url = urlparse(url) 
     domain = str(parsed_url.netloc) 
     self.allowed_domains = [domain] 
     self.start_urls = [url] 
     self.rules = (
      # Extract links 
      # and follow links from them (since no callback means follow=True by default). 
      Rule(SgmlLinkExtractor(allow=allow_follow, deny=deny_follow)), 

      # Extract links and parse them with the spider's method parse_item 
      Rule(SgmlLinkExtractor(allow=allow_extraction, deny=deny_extraction), callback='parse_item'), 
     ) 
     super(LernaSpider, self).__init__(name, **kw) 

    def parse_item(self, response): 
     print 'Crawling... %s' % response.url 
     # more stuff here 

正则表达式的东西看起来很好:空值允许所有和拒绝没有。

+0

是的,我正在脚本中自己实例化蜘蛛。代码就像,crawler = CrawlerProcess(设置) spider = LernaSpider(url) crawler.crawl(蜘蛛) 还有更多的东西比它,显然,但这是简短的版本。 – oiez 2013-03-22 23:11:40

+0

我只是试着改变允许=(allow_extraction)允许= allow_extraction,它的工作!不是100%肯定为什么,但感谢给我一些工作。 :) – oiez 2013-03-22 23:30:09

+0

@steven almeroth,我可以更改抓取后的规则吗?像'SpiderName.rules = new_rules'? – wolfgang 2015-08-13 09:05:09