避免刮数据已经刮

我仍然对我的蜘蛛工作从新闻网站刮的数据，但遇到了另外一个问题，我原来的问题被张贴在这里：Scrapy outputs [ into my .json file但已经解决。

我已经设法得到了一些进一步，必须让空物品的补贴和添加搜索功能，我现在试图刮只有我还没有刮的文章（考虑到我仍然想要从中提取链接）。我找不出将代码放在哪里：

a。）定义最后一次抓取的时间是什么 b。）比较文章的日期和上次抓取的日期。

我可能只是在努力与逻辑，所以我转向你。

我蜘蛛：

# tabbing in python is apparently VERY important so be aware and make sure 
# things that should line up do so 

# import the CrawlSpider Class, along with it's Rules, (this lets us recursively 
# crawl pages) 

from scrapy.contrib.spiders import CrawlSpider, Rule 

#import the link extractor, this extracts links from pages 

from scrapy.contrib.linkextractors import LinkExtractor 

# import our items as defined in items.py 

from basic.items import BasicItem 

# import datetime so that we can get the current date and time 

import time 

# import re which allows us to compare strings 

import re 

# create a new Spider with the CrawlSpider Class 

class BasicSpiderSpider(CrawlSpider): 

    # Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider) 

    name = "basic_spider" 

    # domains that the spider is allowed to crawl over 

    allowed_domains = ["news24.com"] 

    # where to start crawling from 

    start_urls = [ 
     'http://www.news24.com', 
    ] 

    # Rules for the link extractor, (i.e where it's allowed to look for links, 
    # what to do once it's found them, and whether it's allowed to follow them 

    rules = (Rule (LinkExtractor(), callback="parse_items", follow= True), 
    ) 

    # defining the callback function 

    def parse_items(self, response): 

     # defines the Top level XPath where all of our information can be found, needs to be 
     # as specific as possible to avoid duplicates 

     for title in response.xpath('//*[@id="aspnetForm"]'): 

      # List of keywords to search through. 

      key = re.compile("joburg|durban", re.IGNORECASE) 

      # extracting the data to compare with the keywords, this is for the 
      # headlines, the join converts it from a list type to a string type 

      headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract() 
      head = ''.join(headlist) 

      # and this is for the article. 

      artlist = title.xpath('//*[@id="article-body"]//text()').extract() 
      art = ''.join(artlist) 

      # if any keywords are found in the headline: 

      if key.search(head): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        # define the top level xpath again as python won't look outside 
        # it's current fuction 

        for thing in response.xpath('//*[@id="aspnetForm"]'): 

         # fills the items defined in items.py with relevant data 

         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         # I found that even with being careful about my XPaths I 
         # still got empty fields and lines, the below fixes that 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item 

      # if the headline item doesn't match, check the article item. 

      elif key.search(art): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        for thing in response.xpath('//*[@id="aspnetForm"]'): 
         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item

它不工作，但正如我所说，我是持怀疑态度的逻辑反正有人可以让我知道如果我在这里在正确的轨道上？

再次感谢所有的帮助。

来源

2015-04-01 Grant Basson

您似乎完全不在上下文中使用last_crawled。但也懒得多用它，你会多使用deltafetch中间件的更好，为创建正是你正在尝试做的：

这是一个蜘蛛中间件的忽略可见含项目页面在以前的同一个蜘蛛爬网中，因此产生仅包含新项目的 “三角洲爬网”。

要使用deltafetch，安装scrapylib第一：

pip install scrapylib

，之后，使其能够在settings.py：

SPIDER_MIDDLEWARES = { 
    'scrapylib.deltafetch.DeltaFetch': 100, 
} 

DELTAFETCH_ENABLED = True

来源

2015-04-02 06:04:29 bosnjak

感谢劳伦斯，我所遇到的引用在试图找到deltafetch答案是这样的，但它并没有满足我的需求，我希望访问该页面并从中提取链接以遵循（如果有新的相关文章或其他类似的文章）。即使不想从中提取项目数据，deltafetch仍会提取要从页面跟随的链接？将有一个游戏无论并让你知道，感谢您的答复！ – 2015-04-02 06:21:39

尽我所能，完美地工作，谢谢！ – 2015-04-02 13:46:24

避免刮数据已经刮

回答

相关问题