我该如何保留Scrapy的最低价格产品？

我刮的网站有多个产品具有相同的ID但价格不同。我想只保留最低的价格版本。我该如何保留Scrapy的最低价格产品？

from scrapy.exceptions import DropItem 

class DuplicatesPipeline(object): 

    def __init__(self): 
     self.ids_seen = dict() 

    def process_item(self, item, spider): 
     if item['ID'] in self.ids_seen: 
      if item['sale_price']>self.ids_seen[item['ID']]: 
       raise DropItem("Duplicate item found: %s" % item) 
     else: 
      self.ids_seen.add(item['ID']) 
      return item

所以这个代码应该下降是一个更高的价格比之前看到的项目，但我无法弄清楚如何在价格较低更新先前刮项目。

# -*- coding: utf-8 -*- 
import scrapy 
import urlparse 
import re 

class ExampleSpider(scrapy.Spider): 
    name = 'name' 
    allowed_domains = ['domain1','domain2'] 
    start_urls = ['url1','url2'] 

    def parse(self, response): 
     for href in response.css('div.catalog__main__content .c-product-card__name::attr("href")').extract(): 
      url = urlparse.urljoin(response.url, href) 
      yield scrapy.Request(url=url, callback=self.parse_product) 

    # follow pagination links 
     href = response.css('.c-paging__next-link::attr("href")').extract_first() 
     if href is not None: 
      url = urlparse.urljoin(response.url, href) 
      yield scrapy.Request(url=url, callback=self.parse) 
    def parse_product(self, response): 
     # process the response here (omitted because it's long and doesn't add anything) 
     yield { 
      'product-name': name, 
      'price-sale': price_sale, 
      'price-regular': price_regular[:-1], 
      'raw-sku': raw_sku, 
      'sku': sku.replace('_','/'), 
      'img': response.xpath('//img[@class="itm-img"]/@src').extract()[-1], 
      'description': response.xpath('//div[@class="product-description__block"]/text()').extract_first(), 
      'url' : response.url, 
     }

来源

2017-07-02 Taha Attari

什么是你刮的网站？什么是蜘蛛的代码？ – Umair

@Umair我不能告诉你网站，但我已经包含了蜘蛛代码。不确定它适用于这个问题，但在这里。 –

在开始之前，您是否知道产品ID？如果是这样，那么正常的网站行为将允许您搜索价格低>高，因此您可以刮掉每个产品ID返回的第一个项目，这将避免任何管道处理的需要。

如果您不这样做，那么您可以执行两个步骤，首先搜索所有产品以获取Id，然后针对每个Id执行上述过程。

来源

2017-07-03 15:47:04 user2525823

我不知道身份证，但从价格低至高的页面开始是一个好主意！谢谢 –

你不能用流水线做这件事，因为它正在进行中。换句话说，它会返回物品，而不用等待蜘蛛完成。

不过，如果你有一个数据库，你可以解决这个问题：

在semy伪代码：

class DbPipeline(object): 

    def __init__(self): 
     self.connection = # connect to your database 

    def process_item(self, item, spider): 
     db_item = self.connection.get(item['ID']) 
     if item['price'] < db_item['price']: 
      self.connection.remove(item['ID']) 
      self.connection.add(item) 
     return item

你仍然得到scrapy输出未经过滤的结果，但你的数据库会订购。
个人建议将使用基于文档的数据库，键值数据库，如redis。

来源

2017-07-02 11:54:59 Granitosaurus

是否没有选择通过自定义项目导出器完成此操作？我想在scrapinghub上运行它，并且我已经设置了一个系统来使用它们的API。 –

@TahaAttari您可能可以，但这意味着将所有数据保存在缓冲区中，然后将其写入文件中，这是一个坏主意，除非爬网非常小。 – Granitosaurus

我该如何保留Scrapy的最低价格产品？

回答

相关问题