使用正确的解析器处理重定向的响应

我正在使用scrapy爬取网站。 parse方法首先提取所有类别链接，然后调用parse_category回调请求。使用正确的解析器处理重定向的响应

问题是如果任何类别有一个产品它重定向到产品页面。我的parse_category未能识别此页面。

现在我该如何解析带产品页面分析器的重定向类别页面？

这里是一个例子。

parse找到3个分类页面。
1. http://example.com/products/samsung
2. http://example.com/products/dell
3. http://example.com/products/apple
pare_category调用所有这些页面。每个返回一个带有产品列表的html页面。但是apple有单一产品iMac 27"。所以它重定向到http://example.com/products/apple/imac_27。这是一个产品页面。类别解析无法解析它。

我需要产品解析方法parse_product应在此方案中调用。我怎么做？

我可以在我的parse_category方法中添加一些逻辑并调用parse_product。我不想要它。我希望scrapy能做到。但是，我会给url模式或任何其他必要的信息。

这是代码。

class ExampleSpider(BaseSpider): 
    name = u'example.com' 
    allowed_domains = [u'www.example.com'] 
    start_urls = [u'http://www.example.com/category.aspx'] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 

     anchors = hxs.select('/xpath') 
     for anchor in anchors: 
      yield Request(urljoin(get_base_url(response), anchor), callback=self.parse_category) 

    def parse_category(self, response): 
     hxs = HtmlXPathSelector(response) 

     products = hxs.select(products_xpath).extract() 
     for url in products: 
      yield Request(url, callback=self.parse_product) 


    def parse_product(self, response): 
     # product parsing ... 
     pass

来源

2013-07-12 Genghis Khan

@alecxe我不认为代码在这里是必要的。我已经很好地描述了我的问题。问题不在于代码。我仍然给你我的简化蜘蛛。 –

你可以选择写一个middleware它实现了process_response方法。只要您的产品URL而不是类别的响应，请为您的产品解析器创建一个copy of the Request object和change the callback function。

最后，从中间件返回新的Request对象。注意：您可能需要将dont_filter设置为True，以确保新的Request确保DupeFilter不会过滤请求。

来源

2013-07-12 11:32:12

使用正确的解析器处理重定向的响应

回答

相关问题