Scrapy抓取仅供内部链接，包括相对链接

我需要使用scrapy用于抓取网页的所有内部网络链接，使得在例如www.stackovflow.com所有链接被抓取。此代码排序工作的：Scrapy抓取仅供内部链接，包括相对链接

extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain)) 

    for link in extractor.extract_links(response): 
     self.registerUrl(link.url)

然而，有一个小问题，如/meta或所有相对路径不抓取作为不包含基本域stackoverflow.com。任何想法如何解决这一问题？

来源

2016-11-04 Lars Nielsen

不scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware是否做到这一点？ –

感谢我显然发现了一些旧的文档 –

如果我理解正确的问题，你要使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

筛选出由蜘蛛所涉领域之外的URL请求。
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any 
列表中的域也是允许的。例如。规则是www.example.org 也将使bob.www.example.org但不www2.example.com也不 example.com。
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message 
类似于此：
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> 

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example, 
如果www.othersite.com另一个请求进行过滤，没有日志消息将被打印。但是，如果过滤了某个someothersite.com的请求，则会打印一条消息（但仅用于过滤的第一个请求）。
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. 

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in 
允许的域。

我的理解是，网址，然后过滤标准化。

来源

2016-11-04 13:47:44

应该OffsiteMiddleware被设置在settings.py禁用？ –

只是可以肯定，它的工作原理 –

否“scrapy.spidermiddlewares.offsite.OffsiteMiddleware”：500，看https://doc.scrapy.org/en/latest/topics/settings.html?highlight=OffsiteMiddleware –

Scrapy抓取仅供内部链接，包括相对链接

回答

相关问题