2016-04-15 54 views
1

我正在构建scrapy的抓取工具,该抓取工具应抓取整个域以查找损坏的EXTERNAL链接。如何找到外部404s

我有以下几点:

class domainget(CrawlSpider): 
    name = 'getdomains' 
    allowed_domains = ['start.co.uk'] 
    start_urls = ['http://www.start.co.uk'] 

    rules = (
     Rule(LinkExtractor('/'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
      resp = scrapy.Request(link.url, callback=self.parse_ext) 


    def parse_ext(self, response): 
     self.logger.info('>>>>>>>>>> Reading: %s', response.url) 

当我运行这段代码,它永远不会到达parse_ext()函数,我想获得的HTTP状态代码,做在此基础上进一步的处理。

你可以看到我使用parse_ext()作为回调函数,当我在parse_item()函数的页面上循环提取的链接时。

我在做什么错?

回答

0

你是不是从回调返回Request实例:

def parse_item(self, response): 
    for link in LinkExtractor(allow=(), deny = self.allowed_domains).extract_links(response): 
     yield scrapy.Request(link.url, callback=self.parse_ext) 

def parse_ext(self, response): 
    self.logger.info('>>>>>>>>>> Reading: %s', response.url) 
+0

宾果!此外,我不得不将dont_filter = True添加到Request对象中,如下所示: yield scrapy.Request(link.url,callback = self.parse_ext,dont_filter = True) –