Scrapy检查重复管道

我通过我下载的插件将时间戳传递到DynamoDB。蜘蛛每隔两分钟就在cron上。之前，它曾经从网站XPath中获取时间戳，因此它是唯一的;但目前每次新运行都会生成新的时间戳，因此每次运行都会创建一个新条目。你能否请我指导一个管道解决方案来检查是否存在相同的url，所以蜘蛛跳过它？Scrapy检查重复管道

我的蜘蛛：

def parse(self, response): 

    for item in response.xpath("//li[contains(@class, 'river-block')]"): 
     url = item.xpath(".//h2/a/@href").extract()[0] 
     stamp = Timestamp().timestamp 
     yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp}) 

def get_details(self, response): 
     article = ArticleItem() 
     article['title'] = response.xpath("//header/h1/text()").extract_first() 
     article['url'] = format(shortener.short(response.url)) 
     article['stamp'] = response.meta['stamp'] 
     yield article

我的管道：

class DynamoDBStorePipeline(object): 

def process_item(self, item, spider): 
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2") 

    table = dynamodb.Table('x') 

    table.put_item(
    Item={ 
    'url': str(item['url']), 
    'title': item['title'].encode('utf-8'), 
    'stamp': item['stamp'], 
    } 
    ) 
    return item

来源

2017-06-02 yurashark

默认情况下Scrapy不执行相同的请求多次。

欲了解更多信息，你可以阅读here约dont_filter谁是默认为false忽略重复过滤器。

无论如何另一种解决方案，你可以创建一个数组，并检查你的标题是否存在于你的数组中。我认为这是更好地在这里重复检查比管道，因为如果是在重复的情况下，你会不会做，你不需要

url = response.xpath("//header/h1/text()").extract_first() 
if(url not in yourArray) : 
    article = ArticleItem() 
    article['title'] = response.xpath("//header/h1/text()").extract_first() 
    article['url'] = url 
    article['stamp'] = response.meta['stamp'] 
    yourArray.append(url) 
    yield article

来源

2017-06-02 14:47:37 parik

这将检查我的DynamoDB中的项目？ – yurashark

我写的代码为您提供了具有唯一网址的项目，这意味着您不会有2个项目具有相同的网址。 – parik

网址是独一无二的。时间戳不是因为它们每次运行cron时都会生成。我尝试过'attribute_not_exists'，但这并没有帮助我。我想我需要'exists（）'，但我不知道如何实现它。对Python来说很新鲜 – yurashark

通过计算器的问题和Boto3文档挖我之后的另一件事能够拿出解决方案：

class DynamoDBStorePipeline(object): 

def process_item(self, item, spider): 
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2") 

    table = dynamodb.Table('x') 

    table.put_item(
    Item={ 
    'link': str(item['link']), 
    'title': item['title'].encode('utf-8'), 
    'stamp': item['stamp'], 
    }, 
    ConditionExpression = 'attribute_not_exists(link) AND attribute_not_exists(title)', 
    ) 
    return item

来源

2017-06-02 18:59:09 yurashark

Scrapy检查重复管道

回答

相关问题