2016-03-07 60 views
1

我在python中运行Scrapy蜘蛛来从网站上抓取图像。其中一个图像无法下载(即使我试图通过网站定期下载),这是该网站的内部错误。这很好,我不关心试图获取图像,我只是想在图像失败时跳过图像并移动到其他图像上,但是我一直得到10054错误。Scrapy:重试图像下载后出现错误10054

> Traceback (most recent call last): File 
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588, 
> in _runCallbacks 
>  current.result = callback(current.result, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137, 
> in parse_photo_page 
>  self.retrievePhoto(base_url_photo + url[0], url_text) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f 
>  return Retrying(*dargs, **dkw).call(f, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call 
>  raise attempt.get() File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get 
>  six.reraise(self.value[0], self.value[1], self.value[2]) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call 
>  attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 
> 216, in retrievePhoto 
>  code.write(f.read()) File "c:\python27\lib\socket.py", line 355, in read 
>  data = self._sock.recv(rbufsize) File "c:\python27\lib\httplib.py", line 612, in read 
>  s = self.fp.read(amt) File "c:\python27\lib\socket.py", line 384, in read 
>  data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote 

这里是我的解析功能,看起来在照片页面,发现重要的URL:

def parse_photo_page(self, response): 
     for sel in response.xpath('//table[@id="tblData"]/tr'): 
      url = sel.xpath('td/font/a/@href').extract() 
      table_fields = sel.xpath('td/font/text()').extract() 
      if url: 
       base_url_photo = "http://www-nrd.nhtsa.dot.gov/" 
       url_text = table_fields[3] 
       url_text = string.replace(url_text, "&nbsp","") 
       url_text = string.replace(url_text," ","") 
       self.retrievePhoto(base_url_photo + url[0], url_text) 

这里是我的重试装饰下载功能:

from retrying import retry 
@retry(stop_max_attempt_number=5, wait_fixed=2000) 
    def retrievePhoto(self, url, filename): 
     fullPath = self.saveLocation + "/" + filename 
     urllib.urlretrieve(url, fullPath) 

它重试下载5次,但然后抛出10054错误,不会继续到下一个图像。如何让蜘蛛在重试后继续?再次,我不在乎下载问题图片,我只是想跳过它。

+1

它不建议混合同步网络IO(如'urllib.urlretrieve')和异步IO(scrapy /扭曲)。在任何情况下,5次重试后,'self.retrievePhoto(base_url_photo + url [0],url_text)'仍然可以引发异常。如果你想在'parse_photo_page'中继续循环迭代,你需要在'try:... except:...'内捕获它。 Scrapy有一个['ImagesPipeline'](http://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline)来异步检索图像。 –

+0

感谢您的评论,我正试图实现一个ImagesPipeline现在...不能完全得到它的工作,我不记得这个文件 –

+0

@JohnK:你是说你想贡献通过改进文档来开源项目? –

回答

1

在scrapy中不应该使用urllib是正确的,因为它阻止了所有内容。尝试阅读与“scrapy twisted”和“scrapy asynchronous”有关的资源。无论如何...我不相信你的主要问题是“重试后继续”,但是没有在你的表达式中使用“相关的xpaths”。下面是对我工作的版本(注意'./td/font/a/@href'./):

import scrapy 
import string 
import urllib 
import os 

class MyspiderSpider(scrapy.Spider): 
    name = "myspider" 
    start_urls = (
     'file:index.html', 
    ) 

    saveLocation = os.getcwd() 

    def parse(self, response): 
     for sel in response.xpath('//table[@id="tblData"]/tr'): 
      url = sel.xpath('./td/font/a/@href').extract() 
      table_fields = sel.xpath('./td/font/text()').extract() 
      if url: 
       base_url_photo = "http://www-nrd.nhtsa.dot.gov/" 
       url_text = table_fields[3] 
       url_text = string.replace(url_text, "&nbsp","") 
       url_text = string.replace(url_text," ","") 
       self.retrievePhoto(base_url_photo + url[0], url_text) 

    from retrying import retry 
    @retry(stop_max_attempt_number=5, wait_fixed=2000) 
    def retrievePhoto(self, url, filename): 
     fullPath = self.saveLocation + "/" + filename 
     urllib.urlretrieve(url, fullPath) 

而这里的跟随你的模式,但使用的是@保罗trmbrth提到ImagesPipeline一个(更好的)版本。

import scrapy 
import string 
import os 

class MyspiderSpider(scrapy.Spider): 
    name = "myspider2" 
    start_urls = (
     'file:index.html', 
    ) 

    saveLocation = os.getcwd() 

    custom_settings = { 
     "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1}, 
     "IMAGES_STORE": saveLocation 
    } 

    def parse(self, response): 
     image_urls = [] 
     image_texts = [] 
     for sel in response.xpath('//table[@id="tblData"]/tr'): 
      url = sel.xpath('./td/font/a/@href').extract() 
      table_fields = sel.xpath('./td/font/text()').extract() 
      if url: 
       base_url_photo = "http://www-nrd.nhtsa.dot.gov/" 
       url_text = table_fields[3] 
       url_text = string.replace(url_text, "&nbsp","") 
       url_text = string.replace(url_text," ","") 
       image_urls.append(base_url_photo + url[0]) 
       image_texts.append(url_text) 

     return {"image_urls": image_urls, "image_texts": image_texts} 

我使用的演示文件是这样的:

$ cat index.html 
<table id="tblData"><tr> 

<td><font>hi <a href="img/2015/cav.jpg"> foo </a> <span /> <span /> green.jpg  </font></td> 

</tr><tr> 

<td><font>hi <a href="img/2015/caw.jpg"> foo </a> <span /> <span /> blue.jpg  </font></td> 

</tr></table> 
+0

非常感谢@neverlastn !!我同意图像管道是要走的路。我昨天试图实施一条管线,并且无法完成它的工作。 custom_settings的这个小片段为我做了,我认为我的settings.py文件没有被正确地引用。再次感谢您的完整答复。 –

+0

不客气! :)我认为'settings.py'是这样做的正确方法。 'custom_settings'是一个小黑客 - 不是很干净!我只是把它放在那里有一个简单的自包含的答案。 – neverlastn

+0

“实施管道” - 这非常棘手。不要忘记,当你想要找到如何做任何事时,总是谷歌“无论什么扭曲”。 Scrapy是一个扭曲的应用程序,除非你使用Twisted相关技术(例如'urllib'),否则你的表现将受到影响。这里有几个例子:https:// github。com/scalingexcellence/scrapybook/tree/master/ch09/properties/properties/pipelines – neverlastn