如何使用scrapy将多个页面中的数据收集到单个数据结构中

我试图从网站中抓取数据。数据被组织为多个对象，每个对象都有一组数据。例如，有姓名，年龄和职业的人。如何使用scrapy将多个页面中的数据收集到单个数据结构中

我的问题是，这个数据分为两个级别的网站。
第一页是一个名称和年龄的列表，带有指向每个人个人资料页面的链接。
他们的个人资料页面列出他们的职业。

我已经有一个用python写的python，它可以从顶层收集数据并通过多个分页进行爬取。
但是，如何从内部页面收集数据，同时将其链接到适当的对象？

目前，我已经输出结构用JSON作为

{[name='name',age='age',occupation='occupation'], 
    [name='name',age='age',occupation='occupation']} etc

可以在这样的页面解析功能覆盖面？

来源

2013-02-14 user2071236

这里是你需要处理的一种方式。当物品具有所有属性时，您需要退货/退货一次

yield Request(page1, 
       callback=self.page1_data) 

def page1_data(self, response): 
    hxs = HtmlXPathSelector(response) 
    i = TestItem() 
    i['name']='name' 
    i['age']='age' 
    url_profile_page = 'url to the profile page' 

    yield Request(url_profile_page, 
        meta={'item':i}, 
    callback=self.profile_page) 


def profile_page(self,response): 
    hxs = HtmlXPathSelector(response) 
    old_item=response.request.meta['item'] 
    # parse other fileds 
    # assign them to old_item 

    yield old_item

来源

2013-02-14 09:11:23

如何使用scrapy将多个页面中的数据收集到单个数据结构中

回答

相关问题