我的蜘蛛运行没有显示任何错误,但图像没有保存在文件夹下面是我scrapy文件:Scrapy图像下载
Spider.py:
import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["someurl.com"]
start_urls = [
"someurl.com"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item = ListResidentialItem()
try:
image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
item['image_urls'] = [ x for x in image_urls]
except IndexError:
item['image_urls'] = ''
return item
settings.py:
from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'production'
SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'
CONCURRENT_REQUESTS = 250
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}
items.py
# -*- coding: utf-8 -*-
import scrapy
class ProductionItem(scrapy.Item):
img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pass
我的管道文件是空的我不确定我想要添加到pipeline.py文件。
任何帮助,非常感谢。
谢谢Rafael,但是仍然没有图像填充图像文件夹,我将管道添加到了settings.py文件。改变了存储路径并改变了以下几行image_urls = map(unicode.strip,response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。extract()) item ['image_urls '] = [x for image_urls] to item ['image_urls'] = map(unicode.strip,response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。提取()) – user1443063
你不能映射的图像,如果你想保存多个图像在一个项目中,你必须制作一个数组而不是地图,这将不会工作 –
我对这一切都很新,我试图通过改变它来修复它? item ['image_urls'] = response.xpath('// a [@ itemprop =“contentUrl”]/@ data-href')。extract()[0] [0]只能给出一个图像,但它仍然没有显示我是否仍然缺少一些东西,还是仍然是一个数组? – user1443063