2015-10-05 82 views
1

我想设计一个抓取flipkart数据的网络爬虫。我正在使用mongoDB来存储数据。我的代码如下:Scrapy扔URL错误

WebSpider.py

from scrapy.spider import CrawlSpider 
from scrapy.selector import Selector 
from spider_web.items import SpiderWebItem 

class WebSpider(CrawlSpider): 
    name = "spider_web" 
    allowed_domains = ["http://www.flipkart.com"] 
    start_urls = [ 
      "http://www.flipkart.com/search?q=amish+tripathi", 
    ] 
    def parse(self, response): 
      books = response.selector.xpath(
      '//div[@class="old-grid"]/div[@class="gd-row browse-grid-row"]') 

    for book in books: 
     item = SpiderWebItem() 

     item['title'] = book.xpath(
      './/div[@class="pu-details lastUnit"]/div[@class="pu-title fk-font-13"]/a[contains(@href, "from-search")]/@title').extract()[0].strip() 

     item['rating'] = book.xpath(
      './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0] 

     item['noOfRatings'] = book.xpath(
      './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/text()').extract()[1].strip() 

     item['url'] = response.url 

     yield item 

items.py

from scrapy.item import Item, Field 

class SpiderWebItem(Item): 
    url = Field() 
    title = Field() 
    rating = Field() 
    noOfRatings = Field() 

pipelines.py

import pymongo 

from scrapy.conf import settings 
from scrapy.exceptions import DropItem 
from scrapy import log 


class MongoDBPipeline(object): 

     def __init__(self): 
      connection = pymongo.MongoClient(
       settings['MONGODB_SERVER'], 
       settings['MONGODB_PORT'] 
     ) 
      db = connection[settings['MONGODB_DB']] 
      self.collection = db[settings['MONGODB_COLLECTION']] 

    def process_item(self, item, spider): 
     for data in item: 
       if not data: 
       raise DropItem("Missing data!") 
     self.collection.update({'title': item['title']}, dict(item), upsert=True) 
     log.msg("book added to MongoDB database!", 
      level=log.DEBUG, spider=spider) 
     return item 

settings.py BOT_NAME = 'spider_web'

SPIDER_MODULES = ['spider_web.spiders'] 
NEWSPIDER_MODULE = 'spider_web.spiders' 
DOWNLOAD_HANDLERS = { 
     's3': None, 
} 
DOWNLOAD_DELAY = 0.25 
DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue' 
ITEM_PIPELINES = ['spider_web.pipelines.MongoDBPipeline', ] 

MONGODB_SERVER = "localhost" 
MONGODB_PORT = 27017 
MONGODB_DB = "flipkart" 
MONGODB_COLLECTION = "books" 

我检查了每个xpath与scrapy外壳。他们正在产生正确的结果。但是start_URL正在抛出。当我运行蜘蛛的错误是:

2015-10-05 20:05:10 [scrapy] ERROR: Spider error processing <GET http://www.flipkart.com/search?q=rabindranath+tagore> (
referer: None) 

........ 

    File "F:\myP\Web Scraping\spider_web\spider_web\spiders\WebSpider.py", line 21, in parse 
    './/div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0] 
IndexError: list index out of range 

我在我的智慧在这里结束。蜘蛛正在为一个或两个项目提取数据,然后提出错误,蜘蛛正在一起停止。任何帮助将不胜感激。先谢谢你。

回答

0

还有书没有等级,你也需要处理它们。例如:

try: 
    item['rating'] = book.xpath('.//div[@class="pu-details lastUnit"]/div[@class="pu-rating"]/div[1]/@title').extract()[0] 
except IndexError: 
    item['rating'] = 'no rating' 

但是,我真的想使用Item Loaders具有输入和输出的处理器,让他们处理这些情况。