禁止抓取某个网址

由于某些原因，某个移动网址正在被抓取，并且生成的网址在被抓取时发生错误。我希望scrapy只是忽略url，不要调用解析方法或其他任何东西。禁止抓取某个网址

class MySpider(scrapy.Spider): 

    # name, allowed_domains etc 
    rules = Rule(LxmlLinkExtractor(deny=r'/m/.+') # deny http://example.com/m/anything-here.html

但这是行不通的，这样的链接仍然被抓取。

来源

2014-12-11 yayu

根据the docs：

deny（正则表达式（或列表）） - 一个单一的正则表达式（或正则表达式的清单），该（绝对）的URL必须按顺序匹配要被排除（即未提取）。

和/m/.+将不匹配绝对URL，如http://example.com/m/anything-here.html。出于同样的原因，你需要的.+就完了，你需要在一开始至少.*：

>>> print(re.match(r'/m/.+', 'http://example.com/m/anything-here.html')) 
None 
>>> print(re.match(r'.*/m/.+', 'http://example.com/m/anything-here.html')) 
<_sre.SRE_Match object; span=(0, 39), match='http://example.com/m/anything-here.html'>

来源

2014-12-11 01:42:09 abarnert

如果他想否认只是问题的域，更好的表达是通过'http：//例子.com/m /.+'，因为他可能还想要像'http：// test.com/m/something.html'那样的其他人。 – bosnjak 2014-12-11 10:04:15

@劳伦斯：当然，但是考虑到他写的问题的方式，以及他用'/ m /.+'写的这个事实，我敢肯定他想拒绝（1）任何URL '/ m /'，或者（2）路径组件以'/ m /'开头的任何URL，而不是特定的域。 – abarnert 2014-12-13 01:59:04

禁止抓取某个网址

回答

相关问题