2016-11-04 73 views
0

我需要使用scrapy用于抓取网页的所有内部网络链接,使得在例如www.stackovflow.com所有链接被抓取。此代码排序工作的:Scrapy抓取仅供内部链接,包括相对链接

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain)) 

    for link in extractor.extract_links(response): 
     self.registerUrl(link.url) 

然而,有一个小问题,如/meta或​​所有相对路径不抓取作为不包含基本域stackoverflow.com。任何想法如何解决这一问题?

+1

不scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware是否做到这一点? –

+0

感谢我显然发现了一些旧的文档 –

回答

1

如果我理解正确的问题,你要使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

筛选出由 蜘蛛所涉领域之外的URL请求。

This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any 

列表中的域也是允许的。例如。规则是www.example.org 也将使bob.www.example.org但不www2.example.com也不 example.com。

When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message 

类似于此:

DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> 

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example, 

如果www.othersite.com另一个请求进行过滤,没有日志消息 将被打印。但是,如果过滤了某个someothersite.com的请求,则会打印一条 消息(但仅用于过滤的第一个请求)。

If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. 

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in 

允许的域。

我的理解是,网址,然后过滤标准化。

+1

应该OffsiteMiddleware被设置在settings.py禁用? –

+0

只是可以肯定,它的工作原理 –

+0

否“scrapy.spidermiddlewares.offsite.OffsiteMiddleware”:500,看https://doc.scrapy.org/en/latest/topics/settings.html?highlight=OffsiteMiddleware –