0
我一直在试图制作我的第一个抓取工具,并且已经创建了我所需要的(获得1º商店和2º商店的货运信息和价格),但使用2个抓取工具而不是1个,这里有一个大瓶子。Scrapy检测Xpath是否存在
当there'are超过1个店输出的结果是:
In [1]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()').extract()
Out[1]:
[u'ENV\xcdO 3,95\u20ac ',
u'ENV\xcdO GRATIS',
u'ENV\xcdO GRATIS',
u'ENV\xcdO 4,95\u20ac ']
若要仅获取我使用的第二个结果:
In [2]: response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract()
Out[2]: u'ENV\xcdO GRATIS'
但是,当没有第二个结果(只1商店)我得到:
IndexError: list index out of range
而爬行器跳过整个页面,即使其他项目有dat一个...
经过几次尝试后,我决定做一个快速解决方案来获得结果,第一个商店的2个履带1和第二个的履带1,但现在我想干净的只做1履带。
一些帮助,提示或建议将不胜感激,这是我第一次尝试使用scrapy制作递归爬虫,有点像它。
有代码:履带用时是1个多店的正确的格式
# -*- coding: utf-8 -*-
import scrapy
from Guapalia.items import GuapaliaItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class GuapaliaSpider(CrawlSpider):
name = "guapalia"
allowed_domains = ["guapalia.com"]
start_urls = (
'https://www.guapalia.com/perfumes?page=1',
'https://www.guapalia.com/maquillaje?page=1',
'https://www.guapalia.com/cosmetica?page=1',
'https://www.guapalia.com/linea-de-bano?page=1',
'https://www.guapalia.com/parafarmacia?page=1',
'https://www.guapalia.com/solares?page=1',
'https://www.guapalia.com/regalos?page=1',
)
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@class='js-pager']/a[contains(text(),'Siguientes')]"),follow=True),
Rule(LinkExtractor(restrict_xpaths="//div[@class='list-display__item list-display__item--product']/div/a[@class='col-xs-10 col-sm-10 col-md-12 clickOnProduct']"),callback='parse_articles',follow=True),
)
def parse_articles(self, response):
item = GuapaliaItem()
articles_urls = response.url
articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract()
articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract()
articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title')[1].extract()
articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')[1].extract()
articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract()
item['articles_urls'] = articles_urls
item['articles_first_shop'] = articles_first_shop
item['articles_first_shipping'] = articles_first_shipping
item['articles_second_shop'] = articles_second_shop if articles_second_shop else 'N/A'
item['articles_second_shipping'] = articles_second_shipping
item['articles_name'] = articles_name
yield item
基本输出:
2017-09-21 09:53:11 [scrapy] DEBUG: Crawled (200) <GET https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355> (referer: https://www.guapalia.com/perfumes?page=1)
2017-09-21 09:53:11 [scrapy] DEBUG: Scraped from <200 https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355>
{'articles_first_shipping': [u'ENV\xcdO GRATIS'],
'articles_first_shop': [u'DOUGLAS'],
'articles_name': [u'ZEN edp vaporizador 100 ml'],
'articles_second_shipping': u'ENV\xcdO 3,99\u20ac ',
'articles_second_shop': u'BUYSVIP',
'articles_urls': 'https://www.guapalia.com/zen-edp-vaporizador-100-ml-75355'}
问题是,当不存在第二店因为我的代码在现场第二店
IndexError:列表索引超出范围
解决方案由于@Tarun Lalwani
def parse_articles(self, response):
item = GuapaliaItem()
articles_urls = response.url
articles_first_shop = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="retailer-logo autoimage-container"]/img/@title').extract()
articles_first_shipping = response.xpath('//div[@class="container-fluid list-display-box--best-deal"]/div/div/div/div[@class="shipping"]/p//text()').extract()
articles_second_shop = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div/img/@title')
articles_second_shipping = response.xpath('//li[@class="container list-display-box__list__container"]/div/div/div/div/div[@class="shipping"]/p//text()')
articles_name = response.xpath('//div[@id="ProductDetail"]/@data-description').extract()
if len(articles_second_shop) > 1:
item['articles_second_shop'] = articles_second_shop[1].extract()
else:
item['articles_second_shop'] = 'Not Found'
if len(articles_second_shipping) > 1:
item['articles_second_shipping'] = articles_second_shipping[1].extract()
else:
item['articles_second_shipping'] = 'Not Found'
item['articles_urls'] = articles_urls
item['articles_first_shop'] = articles_first_shop
item['articles_first_shipping'] = articles_first_shipping
item['articles_name'] = articles_name
yield item
非常感谢你,它完美的工作! (非常逻辑的回答,咖啡时间)。 –