2015-04-01 78 views


我仍然对我的蜘蛛工作从新闻网站刮的数据,但遇到了另外一个问题,我原来的问题被张贴在这里:Scrapy outputs [ into my .json file但已经解决。


a。)定义最后一次抓取的时间是什么 b。)比较文章的日期和上次抓取的日期。



# tabbing in python is apparently VERY important so be aware and make sure 
# things that should line up do so 

# import the CrawlSpider Class, along with it's Rules, (this lets us recursively 
# crawl pages) 

from scrapy.contrib.spiders import CrawlSpider, Rule 

#import the link extractor, this extracts links from pages 

from scrapy.contrib.linkextractors import LinkExtractor 

# import our items as defined in items.py 

from basic.items import BasicItem 

# import datetime so that we can get the current date and time 

import time 

# import re which allows us to compare strings 

import re 

# create a new Spider with the CrawlSpider Class 

class BasicSpiderSpider(CrawlSpider): 

    # Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider) 

    name = "basic_spider" 

    # domains that the spider is allowed to crawl over 

    allowed_domains = ["news24.com"] 

    # where to start crawling from 

    start_urls = [ 

    # Rules for the link extractor, (i.e where it's allowed to look for links, 
    # what to do once it's found them, and whether it's allowed to follow them 

    rules = (Rule (LinkExtractor(), callback="parse_items", follow= True), 

    # defining the callback function 

    def parse_items(self, response): 

     # defines the Top level XPath where all of our information can be found, needs to be 
     # as specific as possible to avoid duplicates 

     for title in response.xpath('//*[@id="aspnetForm"]'): 

      # List of keywords to search through. 

      key = re.compile("joburg|durban", re.IGNORECASE) 

      # extracting the data to compare with the keywords, this is for the 
      # headlines, the join converts it from a list type to a string type 

      headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract() 
      head = ''.join(headlist) 

      # and this is for the article. 

      artlist = title.xpath('//*[@id="article-body"]//text()').extract() 
      art = ''.join(artlist) 

      # if any keywords are found in the headline: 

      if key.search(head): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        # define the top level xpath again as python won't look outside 
        # it's current fuction 

        for thing in response.xpath('//*[@id="aspnetForm"]'): 

         # fills the items defined in items.py with relevant data 

         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         # I found that even with being careful about my XPaths I 
         # still got empty fields and lines, the below fixes that 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item 

      # if the headline item doesn't match, check the article item. 

      elif key.search(art): 
       if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract() 
        for thing in response.xpath('//*[@id="aspnetForm"]'): 
         item = BasicItem() 
         item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract() 
         item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract() 
         item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract() 
         item["Link"] = response.url 

         if item['Headline']: 
          if item["Article"]: 
           if item["Date"]: 
            last_crawled = (time.strftime("%Y-%m-%d %H:%M")) 
            yield item 






这是一个蜘蛛中间件的忽略可见含 项目页面在以前的同一个蜘蛛爬网中,因此产生仅包含新项目的 “三角洲爬网”。


pip install scrapylib 


    'scrapylib.deltafetch.DeltaFetch': 100, 


感谢劳伦斯,我所遇到的引用在试图找到deltafetch答案是这样的,但它并没有满足我的需求,我希望访问该页面并从中提取链接以遵循(如果有新的相关文章或其他类似的文章)。即使不想从中提取项目数据,deltafetch仍会提取要从页面跟随的链接? 将有一个游戏无论并让你知道,感谢您的答复! – 2015-04-02 06:21:39


尽我所能,完美地工作,谢谢! – 2015-04-02 13:46:24