2015-07-10 738 views
7

我有一个非常简单的代码,如下所示。刮可以,我可以看到所有print报表生成正确的数据。在Pipeline中,初始化工作正常。但是,process_item函数没有被调用,因为print语句在函数的开头永远不会执行。Python,Scrapy,Pipeline:函数“process_item”没有被调用

蜘蛛:comosham.py

import scrapy 
from scrapy.spider import Spider 
from scrapy.selector import Selector 
from scrapy.http import Request 
from activityadvisor.items import ComoShamLocation 
from activityadvisor.items import ComoShamActivity 
from activityadvisor.items import ComoShamRates 
import re 


class ComoSham(Spider): 
    name = "comosham" 
    allowed_domains = ["www.comoshambhala.com"] 
    start_urls = [ 
     "http://www.comoshambhala.com/singapore/classes/schedules", 
     "http://www.comoshambhala.com/singapore/about/location-contact", 
     "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes", 
     "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes" 
    ] 

    def parse(self, response): 
     category = (response.url)[39:44] 
     print 'in parse' 
     if category == 'class': 
      pass 
      """self.gen_req_class(response)""" 
     elif category == 'about': 
      print 'about to call parse_location' 
      self.parse_location(response) 
     elif category == 'rates': 
      pass 
      """self.parse_rates(response)""" 
     else: 
      print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D' 


    def parse_location(self, response): 
     print 'in parse_location'  
     item = ComoShamLocation() 
     item['category'] = 'location' 
     loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract() 
     item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11] 
     item['pin'] = (loc[5])[11:18] 
     item['phone'] = (loc[9])[6:20] 
     item['fax'] = (loc[10])[6:20] 
     item['email'] = loc[12] 
     print item['address'],item['pin'],item['phone'],item['fax'],item['email'] 
     return item 

项目文件:

import scrapy 
from scrapy.item import Item, Field 

class ComoShamLocation(Item): 
    address = Field() 
    pin = Field() 
    phone = Field() 
    fax = Field() 
    email = Field() 
    category = Field() 

管线档案:

class ComoShamPipeline(object): 
    def __init__(self): 
     self.locationdump = csv.writer(open('./scraped data/ComoSham/ComoshamLocation.csv','wb')) 
     self.locationdump.writerow(['Address','Pin','Phone','Fax','Email']) 


    def process_item(self,item,spider): 
     print 'processing item now' 
     if item['category'] == 'location': 
      print item['address'],item['pin'],item['phone'],item['fax'],item['email'] 
      self.locationdump.writerow([item['address'],item['pin'],item['phone'],item['fax'],item['email']]) 
     else: 
      pass 
+0

是否在'parse_location'函数末尾生成了一个项目并且具有它的值? – GHajba

+0

是的,在'parse_location'的末尾,我正在打印它并且输出如预期。 –

+0

我想你有,但我必须问它:你在'settings.py'中配置了ItemPipeline吗? – GHajba

回答

9

您的问题是,你从来没有真正屈服的项目。 parse_location返回一个要解析的项目,但解析永远不会产生该项目。

解决办法是更换:

self.parse_location(response) 

yield self.parse_location(response) 

更具体地说,如果没有项目取得process_item不会被调用。

1

使用ITEM_PIPELINES在settings.py:

ITEM_PIPELINES = ['project_name.pipelines.pipeline_class'] 
0

添加到上述问题的答案,
1.记住添加以下行settings.py中! ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300} 2.当你的蜘蛛运行时产生物品! ​​

+0

将['YOUR_PROJECT_NAME]更正为“[YOUR_PROJECT_NAME]” –

0

这解决了我的问题: 我删除所有的项目我的管道被调用之前,所以process_item()没有得到调用,但open_spider和close_spider是被调用。 因此,tmy解决方案只是改变命令,以便在丢弃项目的其他管道之前使用此管道。

Scrapy Pipeline Documentation.

只要记住,Scrapy调用Pipeline.process_item()仅当有要处理的项目!