我在python中运行Scrapy蜘蛛来从网站上抓取图像。其中一个图像无法下载(即使我试图通过网站定期下载),这是该网站的内部错误。这很好,我不关心试图获取图像,我只是想在图像失败时跳过图像并移动到其他图像上,但是我一直得到10054错误。Scrapy:重试图像下载后出现错误10054
> Traceback (most recent call last): File
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588,
> in _runCallbacks
> current.result = callback(current.result, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137,
> in parse_photo_page
> self.retrievePhoto(base_url_photo + url[0], url_text) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f
> return Retrying(*dargs, **dkw).call(f, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call
> raise attempt.get() File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get
> six.reraise(self.value[0], self.value[1], self.value[2]) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call
> attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line
> 216, in retrievePhoto
> code.write(f.read()) File "c:\python27\lib\socket.py", line 355, in read
> data = self._sock.recv(rbufsize) File "c:\python27\lib\httplib.py", line 612, in read
> s = self.fp.read(amt) File "c:\python27\lib\socket.py", line 384, in read
> data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote
这里是我的解析功能,看起来在照片页面,发现重要的URL:
def parse_photo_page(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, " ","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)
这里是我的重试装饰下载功能:
from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)
它重试下载5次,但然后抛出10054错误,不会继续到下一个图像。如何让蜘蛛在重试后继续?再次,我不在乎下载问题图片,我只是想跳过它。
它不建议混合同步网络IO(如'urllib.urlretrieve')和异步IO(scrapy /扭曲)。在任何情况下,5次重试后,'self.retrievePhoto(base_url_photo + url [0],url_text)'仍然可以引发异常。如果你想在'parse_photo_page'中继续循环迭代,你需要在'try:... except:...'内捕获它。 Scrapy有一个['ImagesPipeline'](http://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline)来异步检索图像。 –
感谢您的评论,我正试图实现一个ImagesPipeline现在...不能完全得到它的工作,我不记得这个文件 –
@JohnK:你是说你想贡献通过改进文档来开源项目? –