1
晚上好一切,避免刮数据已经刮
我仍然对我的蜘蛛工作从新闻网站刮的数据,但遇到了另外一个问题,我原来的问题被张贴在这里:Scrapy outputs [ into my .json file但已经解决。
我已经设法得到了一些进一步,必须让空物品的补贴和添加搜索功能,我现在试图刮只有我还没有刮的文章(考虑到我仍然想要从中提取链接)。我找不出将代码放在哪里:
a。)定义最后一次抓取的时间是什么 b。)比较文章的日期和上次抓取的日期。
我可能只是在努力与逻辑,所以我转向你。
我蜘蛛:
# tabbing in python is apparently VERY important so be aware and make sure
# things that should line up do so
# import the CrawlSpider Class, along with it's Rules, (this lets us recursively
# crawl pages)
from scrapy.contrib.spiders import CrawlSpider, Rule
#import the link extractor, this extracts links from pages
from scrapy.contrib.linkextractors import LinkExtractor
# import our items as defined in items.py
from basic.items import BasicItem
# import datetime so that we can get the current date and time
import time
# import re which allows us to compare strings
import re
# create a new Spider with the CrawlSpider Class
class BasicSpiderSpider(CrawlSpider):
# Name of the spider, this is used to run it, (i.e Scrapy Crawl basic_spider)
name = "basic_spider"
# domains that the spider is allowed to crawl over
allowed_domains = ["news24.com"]
# where to start crawling from
start_urls = [
'http://www.news24.com',
]
# Rules for the link extractor, (i.e where it's allowed to look for links,
# what to do once it's found them, and whether it's allowed to follow them
rules = (Rule (LinkExtractor(), callback="parse_items", follow= True),
)
# defining the callback function
def parse_items(self, response):
# defines the Top level XPath where all of our information can be found, needs to be
# as specific as possible to avoid duplicates
for title in response.xpath('//*[@id="aspnetForm"]'):
# List of keywords to search through.
key = re.compile("joburg|durban", re.IGNORECASE)
# extracting the data to compare with the keywords, this is for the
# headlines, the join converts it from a list type to a string type
headlist = title.xpath('//*[@id="article_special"]//h1/text()').extract()
head = ''.join(headlist)
# and this is for the article.
artlist = title.xpath('//*[@id="article-body"]//text()').extract()
art = ''.join(artlist)
# if any keywords are found in the headline:
if key.search(head):
if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract()
# define the top level xpath again as python won't look outside
# it's current fuction
for thing in response.xpath('//*[@id="aspnetForm"]'):
# fills the items defined in items.py with relevant data
item = BasicItem()
item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()
item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
item["Link"] = response.url
# I found that even with being careful about my XPaths I
# still got empty fields and lines, the below fixes that
if item['Headline']:
if item["Article"]:
if item["Date"]:
last_crawled = (time.strftime("%Y-%m-%d %H:%M"))
yield item
# if the headline item doesn't match, check the article item.
elif key.search(art):
if last_crawled > response.xpath('//*[@id="spnDate"]/text()').extract()
for thing in response.xpath('//*[@id="aspnetForm"]'):
item = BasicItem()
item['Headline'] = thing.xpath('//*[@id="article_special"]//h1/text()').extract()
item["Article"] = thing.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item["Date"] = thing.xpath('//*[@id="spnDate"]/text()').extract()
item["Link"] = response.url
if item['Headline']:
if item["Article"]:
if item["Date"]:
last_crawled = (time.strftime("%Y-%m-%d %H:%M"))
yield item
它不工作,但正如我所说,我是持怀疑态度的逻辑反正有人可以让我知道如果我在这里在正确的轨道上?
再次感谢所有的帮助。
感谢劳伦斯,我所遇到的引用在试图找到deltafetch答案是这样的,但它并没有满足我的需求,我希望访问该页面并从中提取链接以遵循(如果有新的相关文章或其他类似的文章)。即使不想从中提取项目数据,deltafetch仍会提取要从页面跟随的链接? 将有一个游戏无论并让你知道,感谢您的答复! – 2015-04-02 06:21:39
尽我所能,完美地工作,谢谢! – 2015-04-02 13:46:24