我刮的网站有多个产品具有相同的ID但价格不同。我想只保留最低的价格版本。我该如何保留Scrapy的最低价格产品?
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = dict()
def process_item(self, item, spider):
if item['ID'] in self.ids_seen:
if item['sale_price']>self.ids_seen[item['ID']]:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['ID'])
return item
所以这个代码应该下降是一个更高的价格比之前看到的项目,但我无法弄清楚如何在价格较低更新先前刮项目。
# -*- coding: utf-8 -*-
import scrapy
import urlparse
import re
class ExampleSpider(scrapy.Spider):
name = 'name'
allowed_domains = ['domain1','domain2']
start_urls = ['url1','url2']
def parse(self, response):
for href in response.css('div.catalog__main__content .c-product-card__name::attr("href")').extract():
url = urlparse.urljoin(response.url, href)
yield scrapy.Request(url=url, callback=self.parse_product)
# follow pagination links
href = response.css('.c-paging__next-link::attr("href")').extract_first()
if href is not None:
url = urlparse.urljoin(response.url, href)
yield scrapy.Request(url=url, callback=self.parse)
def parse_product(self, response):
# process the response here (omitted because it's long and doesn't add anything)
yield {
'product-name': name,
'price-sale': price_sale,
'price-regular': price_regular[:-1],
'raw-sku': raw_sku,
'sku': sku.replace('_','/'),
'img': response.xpath('//img[@class="itm-img"]/@src').extract()[-1],
'description': response.xpath('//div[@class="product-description__block"]/text()').extract_first(),
'url' : response.url,
}
什么是你刮的网站?什么是蜘蛛的代码? – Umair
@Umair我不能告诉你网站,但我已经包含了蜘蛛代码。不确定它适用于这个问题,但在这里。 –