2016-05-14 52 views
0

激活它介绍

网站我再杀执行从流水线上的特定蜘蛛有两个网址:如何在没有再次

  • /top名单顶级球员
  • /player/{name}显示球员的名字{name} info

从第一个网址,我得到玩家的名字和位置,然后我可以打电话给第二个网址唱名字。我目前的目标是将所有数据存储在数据库中。

问题

我创建了两个蜘蛛。第一个抓取/top,第二个抓取第一个蜘蛛找到的每个玩家的/player/{name}。但是,为了能够第一只蜘蛛数据插入到数据库中,我需要调用轮廓蜘蛛,因为它是一个外键,如以下查询指出:

INSERT INTO top_players (player_id, position) values (1, 1)

INSERT INTO players (name) values ('John Doe')

问题

是否可以从Pipeline执行一个蜘蛛来获取蜘蛛结果?我的意思是,被叫蜘蛛不应该再次启动管道。

回答

1

我建议你有更多的控制刮过程。特别是抓住名字,从第一页和详细页面的位置。 试试这个:

# -*- coding: utf-8 -*- 
import scrapy 

class MyItem(scrapy.Item): 
    name = scrapy.Field() 
    position= scrapy.Field() 
    detail=scrapy.Field() 
class MySpider(scrapy.Spider): 

    name = '<name of spider>' 
    allowed_domains = ['mywebsite.org'] 
    start_urls = ['http://mywebsite.org/<path to the page>'] 

    def parse(self, response): 

     rows = response.xpath('//a[contains(@href,"<div id or class>")]') 

     #loop over all links to stories 
     for row in rows: 
      myItem = MyItem() # Create a new item 
      myItem['name'] = row.xpath('./text()').extract() # assign name from link 
      myItem['position']=row.xpath('./text()').extract() # assign position from link 
      detail_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link 
      request = scrapy.Request(url = detail_url, callback = self.parse_detail) # create request for detail page with story 
      request.meta['myItem'] = myItem # pass the item with the request 
      yield request 

    def parse_detail(self, response): 
     myItem = response.meta['myItem'] # extract the item (with the name) from the response 
     text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the detail (text) 
     myItem['detail'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item 
     yield myItem # return the item 
+0

'scrapy.Request'是关键。它解决了这个问题。谢谢 :) – Doon

相关问题