在CrawlSpider的LinkExtractor中设置follow为true的目的是什么？

我看到他们有一个CrawlSpider此示例代码的文档上：在CrawlSpider的LinkExtractor中设置follow为true的目的是什么？

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

class MySpider(CrawlSpider): 
    name = 'example.com' 
    allowed_domains = ['example.com'] 
    start_urls = ['http://www.example.com'] 

    rules = (
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))), 

     # Extract links matching 'item.php' and parse them with the spider's method parse_item 
     Rule(LinkExtractor(allow=('item\.php',)), callback='parse_item'), 
    ) 

    def parse_item(self, response): 
     self.logger.info('Hi, this is an item page! %s', response.url) 
     item = scrapy.Item() 
     item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') 
     item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() 
     item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() 
     return item

从我的理解会发生下列步骤操作：

的Scrapy蜘蛛（MySpider）以上的将获得从一个响应Scrapy Engine for 'http://www.example.com'链接（位于start_url列表中）。然后，LinkExtractor将根据上面提供的两个规则提取该响应中的所有链接。
现在我们假设第二个LinkExtractor（带回调）得到了3个链接（'http://www.example.com/item1.php','http://www.example.com/item2.php','http://www.example.com/item3.php'），而第一个LinkExtractor没有回调得到了1个链接（www.example.com/category1.php）。

对于上面找到的3个链接，将简单调用指定回调parse_item。但是，对于那一个链接（www.example.com/category1.php）会发生什么，因为没有与它相关的回调？这两个LinkExtractors会再次在这一个链接上运行吗？这个假设是否正确？

来源

2017-04-19 CapturedTree

# Extract links matching 'category.php' (but not matching 'subsection.php') 
# and follow links from them (since no callback means follow=True by default).

由于您的Rule对象没有callback说法，follow参数设置为True。
因此，在您的示例中，将会抓取1个链接并从中提取链接，就像第一个页面完成一样，这将继续，直到第一个规则没有提取更多链接或者已经访问完所有链接。

来源

2017-04-19 06:39:49 Granitosaurus

噢好吧我现在看到。那么基本上这两个'LinkExtractors'会再次从该链接产生的响应中提取正确的链接？当你设置“follow = True”时，是否还有一个回调点？ – CapturedTree

不，没有必要提供回调来跟踪链接，因为您不想手动解析它们。按照这种方式考虑，'follow = True'意味着它会回调一个_hidden_回调函数，它将只响应所有的响应规则而不执行任何其他操作。 – Granitosaurus

您声明'因此，在您的示例中，将会抓取1个链接并从中提取链接。当你说一个链接将被抓取时，你基本上是否意味着它将被抓取基于LinkExtractors正确的链接？ – CapturedTree

在CrawlSpider的LinkExtractor中设置follow为true的目的是什么？

回答

相关问题