2017-03-05 51 views
0

如何告诉Scrapy将所有已获得的项目分为两个列表?例如,假设我有两种主要类型的项目 - articleauthor。我想把它们放在两个单独的列表中。现在我得到输出JSON:Scrapy将项目作为JSON中的子项目

[ 
    { 
    "article_title":"foo", 
    "article_published":"1.1.1972", 
    "author": "John Doe" 
    }, 
    { 
    "name": "John Doe", 
    "age": 42, 
    "email": "[email protected]" 
    } 
] 

如何将它转换为这样的东西?

{ 
    "articles": [ 
    { 
     "article_title": "foo", 
     "article_published": "1.1.1972", 
     "author": "John Doe" 
    } 
    ], 
    "authors": [ 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
    ] 
} 

我对输出这些功能都很简单,与此类似:

def parse_author(self, response): 
     name = response.css('div.author-info a::text').extract_first() 
     print("Parsing author: {}".format(name)) 

     yield { 
      'author_name': name 
     } 

回答

2

项目将分别达到管道,并相应地在此设置添加的每个:

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    author = scrapy.Field() 

class Author(scrapy.Item): 
    name = scrapy.Field() 
    age = scrapy.Field() 

spider.py

def parse(self, response): 

    author = items.Author() 
    author['name'] = response.css('div.author-info a::text').extract_first() 
    print("Parsing author: {}".format(author['name'])) 
    yield author 

    article = items.Article() 
    article['title'] = response.css('article css').extract_first() 
    print("Parsing article: {}".format(article['title'])) 

    yield article 

pipelines.py

process_item(self, item, spider): 
    if isinstance(item, items.Author): 
     # Do something to authors 
    elif isinstance(item, items.Article): 
     # Do something to articles 

我建议,虽然这个架构:

[{ 
    "title": "foo", 
    "published": "1.1.1972", 
    "authors": [ 
     { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
     }, 
     { 
     "name": "Jane Doe", 
     "age": 21, 
     "email": "[email protected]" 
     }, 
    ] 
}] 

这使得全力以赴在一个项目。

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    authors = scrapy.Field() 

spider.py

def parse(self, response): 

    authors = [] 
    author = {} 
    author['name'] = "John Doe" 
    author['age'] = 42 
    author['email'] = "[email protected]" 
    print("Parsing author: {}".format(author['name'])) 
    authors.append(author) 

    article = items.Article() 
    article['title'] = "foo" 
    article['published'] = "1.1.1972" 
    print("Parsing article: {}".format(article['title'])) 
    article['authors'] = authors 
    yield article 
+0

管道访问我仍然不确定如何将给定类型的所有项目分组在一个JSON密钥下。修改管道返回'{'author':item}'仍然为每个项目创建一个'author'键。我想我需要在我自己的列表中的某个地方累积所有项目,然后在最后输出它们作为JSON,但我不知道该怎么做。 :::如果我想主要遍历文章,您建议的架构很好。例如,列出所有作者就会变得更加困难。 –

+0

@MartinMelka我编辑了我的答案 –

1
raw = [ 
    { 
     "article_title":"foo", 
     "article_published":"1.1.1972", 
     "author": "John Doe" 
    }, 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
] 

data = {'articles':[], "authors":[]} 

for a in raw: 

    if 'article_title' in a: 
     data['articles'].extend([ a ]) 

    else: 
     data['articles'].extend([ a ]) 
+0

我不知道如何处理的字典一样,在Scrapy。从解析函数中产生的结果直接传递给Scrapy,最终我无法处理它。你可以扩大你的答案吗? –

+0

@MartinMelka过程意味着哪里?对不起,我没有得到你的问题...我的理解是,你的数据应该可以通过'item ['articles']' – Umair