添加了Scrapy规则，但没有更多的项目

在我的Scrapy输出文件中，我发现有些项目丢失，所以我手动添加那些缺少的页面作为第三条规则。添加了Scrapy规则，但没有更多的项目

class KjvSpider(CrawlSpider): 
    name = 'kjv' 
    start_urls = ['file:///G:/OEBPS2/bible-toc.xhtml'] 

    rules = (
     Rule(LinkExtractor(allow=r'OEBPS'), follow=True),  # 1st rule 

     Rule(LinkExtractor(allow=r'\d\.xhtml$'), 
      callback='parse_item', follow=False),    # 2nd rule 
     Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'],), 
      callback='parse_item', follow=False),    # 3rd rule 
    )

如果我能1st rule和3rd rule（注释掉2nd rule），我可以正确下载的四个遗失物品而不是整个项目（约2000 itmes）。

但是，如果我启用所有三个规则，事实证明，缺少的项目仍然丢失。（即，如果我添加3rd rule，则没有区别。）

我不知道为什么规则不起作用。

任何建议将受到欢迎。提前致谢。

来源

2017-04-09 Aaron

我发现我必须否认这些在1st rule中丢失的网址，因此在3rd rule中，它不会被重复请求过滤掉。所以它会正常提取。

例如

rules = (
    Rule(LinkExtractor(allow=r'OEBPS',deny=(r'2-jn.xhtml$', r'jude.xhtml$', 
     r'obad.xhtml$',r'philem.xhtml$')), follow=True), # 1st rule 

    Rule(LinkExtractor(allow=r'\d\.xhtml$'), 
     callback='parse_item', follow=False),    # 2nd rule 
    Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'],), 
     callback='parse_item', follow=False),    # 3rd rule 
)

来源

2017-04-09 09:46:25 Aaron

添加了Scrapy规则，但没有更多的项目

回答

相关问题