使用Scrapy下载图像时遇到问题

尝试使用Scrapy下载使用蜘蛛的图像时出现以下错误。使用Scrapy下载图像时遇到问题

File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", 
line 61, in _set_url 
      raise ValueError('Missing scheme in request url: %s' % self._url) 
     exceptions.ValueError: Missing scheme in request url: h

尽我所能理解它，它看起来像我在一个URL的某个地方缺少一个“h”？但是我不能为了我的生活看到哪里。一切工作正常，如果我不试图下载图像。但是，一旦我将适当的代码添加到下面的四个文件中，我就无法正常工作。任何人都可以帮我理解这个错误吗？

items.py

import scrapy 

class ProductItem(scrapy.Item): 
    model = scrapy.Field() 
    shortdesc = scrapy.Field() 
    desc = scrapy.Field() 
    series = scrapy.Field() 
    imageorig = scrapy.Field() 
    image_urls = scrapy.Field() 
    images = scrapy.Field()

settings.py

BOT_NAME = 'allenheath' 

SPIDER_MODULES = ['allenheath.spiders'] 
NEWSPIDER_MODULE = 'allenheath.spiders' 

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1} 

IMAGES_STORE = 'c:/allenheath/images'

pipelines.py

class AllenheathPipeline(object): 
    def process_item(self, item, spider): 
     return item 

import scrapy 
from scrapy.contrib.pipeline.images import ImagesPipeline 
from scrapy.exceptions import DropItem 

class MyImagesPipeline(ImagesPipeline): 

    def get_media_requests(self, item, info): 
     for image_url in item['image_urls']: 
      yield scrapy.Request(image_url) 

    def item_completed(self, results, item, info): 
     image_paths = [x['path'] for ok, x in results if ok] 
     if not image_paths: 
      raise DropItem("Item contains no images") 
     item['image_paths'] = image_paths 
     return item

products.py（我蜘蛛）

import scrapy 

from allenheath.items import ProductItem 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse 

class productsSpider(scrapy.Spider): 
    name = "products" 
    allowed_domains = ["http://www.allen-heath.com/"] 
    start_urls = [ 
     "http://www.allen-heath.com/ahproducts/ilive-80/", 
     "http://www.allen-heath.com/ahproducts/ilive-112/" 
    ] 

    def parse(self, response): 
     for sel in response.xpath('/html'): 
      item = ProductItem() 
      item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() 
      item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract() 
      item['desc'] = sel.css('#tab1 #productcontent').extract() 
      item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract() 
      item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract() 
      item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0] 
      item['image_urls'] = 'http://www.allen-heath.com' + item['image_urls'] 
      yield item

任何帮助将不胜感激。

来源

2015-04-28 jkupczak

问题就在这里：

def get_media_requests(self, item, info): 
    for image_url in item['image_urls']: 
     yield scrapy.Request(image_url)

这里：

item['image_urls'] = sel.css('#tab1 #productcontent img').extract()[0]

您提取该领域，并采取的第一个元素。这意味着，一旦您在管道中迭代它，实际上您将迭代URL中的字符，该字符以http开头 - 解释您看到的错误消息，只要第一个字母试图处理：

Missing scheme in request url: h

从行中删除[0]。当你在它，获取图像的src，整个元素代替：

item['image_urls'] = sel.css('#tab1 #productcontent img').xpath('./@src').extract()

之后，你也应该更新的下一行，如果图像URL是相对的，将其转换为绝对：

import urlparse # put this at the top of the script 
item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]

但是，如果在src图像URL其实是绝对的，你并不需要这最后一部分，所以只是将其删除。

来源

2015-04-28 17:23:48 bosnjak

删除'[0]'摆脱了那个错误。一个新的错误弹出。紧随其后的行将域字符串与图像url结合在一起。我这样做是为了防止图像使用相对路径。（虽然我不认为这是必要的）这个错误是：'File“C：\ allenheath \ allenheath \ spiders \ products.py”，第24行，解析项目['image_urls'] ='http：/ /www.allen-heath.com'+ item ['image_urls'] exceptions.TypeError：不能连接'str'和'list'对象'如果我完全删除了这个脚本，我的脚本没有任何错误。但是，我仍然没有图像。 – jkupczak

检查我的更新到答案。 – bosnjak

谢谢。这解决了该线路的问题。我可以把它放在那里，它不会再出错。但是，当我运行它时，我仍然没有下载任何图像。我将结果导出到csv，它正在写一个'image_urls'列，所以我知道它会看到我想要抓取的图像。它只是不下载它们。我必须在我的代码的另一部分中丢失别的东西。 – jkupczak

使用Scrapy下载图像时遇到问题

回答

相关问题