2016-08-04 122 views
0

我的蜘蛛运行没有显示任何错误,但图像没有保存在文件夹下面是我scrapy文件:Scrapy图像下载

Spider.py:

import scrapy 
import re 
import os 
import urlparse 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.loader.processors import Join, MapCompose, TakeFirst 
from scrapy.pipelines.images import ImagesPipeline 
from production.items import ProductionItem, ListResidentialItem 

class productionSpider(scrapy.Spider): 
    name = "production" 
    allowed_domains = ["someurl.com"] 
    start_urls = [ 
     "someurl.com" 
] 

def parse(self, response): 
    for sel in response.xpath('//html/body'): 
     item = ProductionItem() 
     img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0] 
     yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo, meta={'item': item}) 

def parseBasicListingInfo(item, response): 
    item = response.request.meta['item'] 
    item = ListResidentialItem() 
    try: 
     image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract()) 
     item['image_urls'] = [ x for x in image_urls] 
    except IndexError: 
     item['image_urls'] = '' 

    return item 

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES 
from scrapy.pipelines.images import ImagesPipeline 

BOT_NAME = 'production' 

SPIDER_MODULES = ['production.spiders'] 
NEWSPIDER_MODULE = 'production.spiders' 
DEFAULT_ITEM_CLASS = 'production.items' 

ROBOTSTXT_OBEY = True 
DEPTH_PRIORITY = 1 
IMAGE_STORE = '/images' 

CONCURRENT_REQUESTS = 250 

DOWNLOAD_DELAY = 2 

ITEM_PIPELINES = { 
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300, 
} 

items.py

# -*- coding: utf-8 -*- 
import scrapy 

class ProductionItem(scrapy.Item): 
    img_url = scrapy.Field() 

# ScrapingList Residential & Yield Estate for sale 
class ListResidentialItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

    pass 

我的管道文件是空的我不确定我想要添加到pipeline.py文件。

任何帮助,非常感谢。

回答

5

既然你不知道要放什么东西在我假设你可以使用scrapy提供图像的默认管道,以便在settings.py文件你可以声明它像

ITEM_PIPELINES = { 
'scrapy.pipelines.images.ImagesPipeline':1 
} 

同样的管道,你的图片路径错误/意味着您将转到您计算机的绝对根路径,因此您要么将绝对路径设置为您要保存的位置,要么只是从运行抓取工具的位置执行相对路径

IMAGES_STORE = '/home/user/Documents/scrapy_project/images' 

IMAGES_STORE = 'images' 

现在,在蜘蛛您提取的网址,但您不要将其保存到项目

item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first() 

领域有字面上image_urls,如果你使用的是默认管道。

现在,在items.py文件,你需要添加下面的2场(两者都需要有这个文字名称)

image_urls=Field() 
images=Field() 

这应该工作

+0

谢谢Rafael,但是仍然没有图像填充图像文件夹,我将管道添加到了settings.py文件。改变了存储路径并改变了以下几行image_urls = map(unicode.strip,response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。extract()) item ['image_urls '] = [x for image_urls] to item ['image_urls'] = map(unicode.strip,response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。提取()) – user1443063

+0

你不能映射的图像,如果你想保存多个图像在一个项​​目中,你必须制作一个数组而不是地图,这将不会工作 –

+0

我对这一切都很新,我试图通过改变它来修复它? item ['image_urls'] = response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。extract()[0] [0]只能给出一个图像,但它仍然没有显示我是否仍然缺少一些东西,还是仍然是一个数组? – user1443063

4

我的工作最终的结果:

spider.py

import scrapy 
import re 
import urlparse 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.loader.processors import Join, MapCompose, TakeFirst 
from scrapy.pipelines.images import ImagesPipeline 
from production.items import ProductionItem 
from production.items import ImageItem 

class productionSpider(scrapy.Spider): 
    name = "production" 
    allowed_domains = ["url"] 
    start_urls = [ 
     "startingurl.com" 
    ] 

def parse(self, response): 
    for sel in response.xpath('//html/body'): 
     item = ProductionItem() 
     img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0] 
     yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages, meta={'item': item}) 

def parseImages(self, response): 
    for elem in response.xpath("//img"): 
     img_url = elem.xpath("@src").extract_first() 
     yield ImageItem(image_urls=[img_url]) 

Settings.py

BOT_NAME = 'production' 

SPIDER_MODULES = ['production.spiders'] 
NEWSPIDER_MODULE = 'production.spiders' 
DEFAULT_ITEM_CLASS = 'production.items' 
ROBOTSTXT_OBEY = True 
IMAGES_STORE = '/Users/home/images' 

DOWNLOAD_DELAY = 2 

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} 
# Disable cookies (enabled by default) 

items.py

# -*- coding: utf-8 -*- 
import scrapy 

class ProductionItem(scrapy.Item): 
    img_url = scrapy.Field() 
# ScrapingList Residential & Yield Estate for sale 
class ListResidentialItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

class ImageItem(scrapy.Item): 
    image_urls = scrapy.Field() 
    images = scrapy.Field() 

管道。py

import scrapy 
from scrapy.pipelines.images import ImagesPipeline 
from scrapy.exceptions import DropItem 

class MyImagesPipeline(ImagesPipeline): 

    def get_media_requests(self, item, info): 
     for image_url in item['image_urls']: 
      yield scrapy.Request(image_url) 

    def item_completed(self, results, item, info): 
     image_paths = [x['path'] for ok, x in results if ok] 
     if not image_paths: 
      raise DropItem("Item contains no images") 
     item['image_paths'] = image_paths 
     return item