Scrapy将项目作为JSON中的子项目

如何告诉Scrapy将所有已获得的项目分为两个列表？例如，假设我有两种主要类型的项目 - article和author。我想把它们放在两个单独的列表中。现在我得到输出JSON：Scrapy将项目作为JSON中的子项目

[ 
    { 
    "article_title":"foo", 
    "article_published":"1.1.1972", 
    "author": "John Doe" 
    }, 
    { 
    "name": "John Doe", 
    "age": 42, 
    "email": "[email protected]" 
    } 
]

如何将它转换为这样的东西？

{ 
    "articles": [ 
    { 
     "article_title": "foo", 
     "article_published": "1.1.1972", 
     "author": "John Doe" 
    } 
    ], 
    "authors": [ 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
    ] 
}

我对输出这些功能都很简单，与此类似：

def parse_author(self, response): 
     name = response.css('div.author-info a::text').extract_first() 
     print("Parsing author: {}".format(name)) 

     yield { 
      'author_name': name 
     }

来源

2017-03-05 Martin Melka

项目将分别达到管道，并相应地在此设置添加的每个：

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    author = scrapy.Field() 

class Author(scrapy.Item): 
    name = scrapy.Field() 
    age = scrapy.Field()

spider.py

def parse(self, response): 

    author = items.Author() 
    author['name'] = response.css('div.author-info a::text').extract_first() 
    print("Parsing author: {}".format(author['name'])) 
    yield author 

    article = items.Article() 
    article['title'] = response.css('article css').extract_first() 
    print("Parsing article: {}".format(article['title'])) 

    yield article

pipelines.py

process_item(self, item, spider): 
    if isinstance(item, items.Author): 
     # Do something to authors 
    elif isinstance(item, items.Article): 
     # Do something to articles

我建议，虽然这个架构：

[{ 
    "title": "foo", 
    "published": "1.1.1972", 
    "authors": [ 
     { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
     }, 
     { 
     "name": "Jane Doe", 
     "age": 21, 
     "email": "[email protected]" 
     }, 
    ] 
}]

这使得全力以赴在一个项目。

items.py

class Article(scrapy.Item): 
    title = scrapy.Field() 
    published = scrapy.Field() 
    authors = scrapy.Field()

spider.py

def parse(self, response): 

    authors = [] 
    author = {} 
    author['name'] = "John Doe" 
    author['age'] = 42 
    author['email'] = "[email protected]" 
    print("Parsing author: {}".format(author['name'])) 
    authors.append(author) 

    article = items.Article() 
    article['title'] = "foo" 
    article['published'] = "1.1.1972" 
    print("Parsing article: {}".format(article['title'])) 
    article['authors'] = authors 
    yield article

来源

2017-03-06 13:45:47

管道访问我仍然不确定如何将给定类型的所有项目分组在一个JSON密钥下。修改管道返回'{'author'：item}'仍然为每个项目创建一个'author'键。我想我需要在我自己的列表中的某个地方累积所有项目，然后在最后输出它们作为JSON，但我不知道该怎么做。 :::如果我想主要遍历文章，您建议的架构很好。例如，列出所有作者就会变得更加困难。 –

@MartinMelka我编辑了我的答案 –

raw = [ 
    { 
     "article_title":"foo", 
     "article_published":"1.1.1972", 
     "author": "John Doe" 
    }, 
    { 
     "name": "John Doe", 
     "age": 42, 
     "email": "[email protected]" 
    } 
] 

data = {'articles':[], "authors":[]} 

for a in raw: 

    if 'article_title' in a: 
     data['articles'].extend([ a ]) 

    else: 
     data['articles'].extend([ a ])

来源

2017-03-06 00:07:44 Umair

我不知道如何处理的字典一样，在Scrapy。从解析函数中产生的结果直接传递给Scrapy，最终我无法处理它。你可以扩大你的答案吗？ –

@MartinMelka过程意味着哪里？对不起，我没有得到你的问题...我的理解是，你的数据应该可以通过'item ['articles']' – Umair

Scrapy将项目作为JSON中的子项目

回答

相关问题