您好我想抓取http://economictimes.indiatimes.com/archive.cms的数据,所有的网址都是基于日期,月份和年份进行存档的,首先获取url列表我使用https://github.com/FraPochetti/StocksProject/blob/master/financeCrawler/financeCrawler/spiders/urlGenerator.py的代码修改了我的网站作为代码,从scrapy的网站档案中递归地提取URL
import scrapy
import urllib
def etUrl():
totalWeeks = []
totalPosts = []
url = 'http://economictimes.indiatimes.com/archive.cms'
data = urllib.urlopen(url).read()
hxs = scrapy.Selector(text=data)
months = hxs.xpath('//ul/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news.cms')
admittMonths = 12*(2013-2007) + 8
months = months[:admittMonths]
for month in months:
data = urllib.urlopen(month).read()
hxs = scrapy.Selector(text=data)
weeks = hxs.xpath('//ul[@class="weeks"]/li/a').re('http://economictimes.indiatimes.com/archive.cms/\\d+-\\d+/news/day\\d+\.cms')
totalWeeks += weeks
for week in totalWeeks:
data = urllib.urlopen(week).read()
hxs = scrapy.Selector(text=data)
posts = hxs.xpath('//ul[@class="archive"]/li/h1/a/@href').extract()
totalPosts += posts
with open("eturls.txt", "a") as myfile:
for post in totalPosts:
post = post + '\n'
myfile.write(post)
etUrl()
保存文件作为urlGenerator.py
并用命令$ python urlGenerator.py
我越来越没有结果,可能有人帮助我如何采取为我的网站使用情况或任何其他解决方案的代码跑?
是否存在对'etUrl()'的调用,传统上由'if __name__ ==“__main__”:etUrl()'类型结构保护? –
它也**非常WEIRD **来安装Scrapy,但随后使用基于'urllib'的请求响应;可以说,Scrapy的50%的力量在于它如何处理整个过程 - 包括有明确的回调,以避免你在那里进行4深刻的缩进 –
我冒昧地整理了你的文章,因为我假设你不是故意在底部递归调用etUrl()... – Iguananaut