2017-03-31 94 views
0

使用Pandas读取子水平数据时,我卡住了。使用Pandas读取子级JSON数据

背景:

我用NYT存档API下载一系列数据,我保存它实际上有它JSON对象列表的JSON文件。

步骤:

我使用read_json方法读取的JSON文件。

pandas_df = pd.read_json("data.json")

当我用头看样的结果,它看起来像如下:

pandas_df.head() 
    copyright \ 
0 Copyright (c) 2013 The New York Times Company.... 
1 Copyright (c) 2013 The New York Times Company.... 
2 Copyright (c) 2013 The New York Times Company.... 
3 Copyright (c) 2013 The New York Times Company.... 
4 Copyright (c) 2013 The New York Times Company.... 

              response 
0 {'docs': [{'subsection_name': None, 'slideshow... 
1 {'docs': [{'subsection_name': None, 'slideshow... 
2 {'docs': [{'subsection_name': None, 'slideshow... 
3 {'docs': [{'subsection_name': None, 'slideshow... 
4 {'docs': [{'subsection_name': None, 'slideshow... 

我只需要在响应信息。所以,当我改变像下面的代码:

print(pandas_df["response"].head()) 
0 {'docs': [{'subsection_name': None, 'slideshow... 
1 {'docs': [{'subsection_name': None, 'slideshow... 
2 {'docs': [{'subsection_name': None, 'slideshow... 
3 {'docs': [{'subsection_name': None, 'slideshow... 
4 {'docs': [{'subsection_name': None, 'slideshow... 
Name: response, dtype: object 

问:

我如何可以获取使用内部文档元素的数据?像小节,幻灯片等我可以看到它在表格格式,如数据框?

如果需要更多信息,请让我知道。

谢谢。

EDIT 1:

从JSON文件添加第一个元素。这个文件在1GB左右太大了。

{ 
    "copyright": "Copyright (c) 2013 The New York Times Company. All Rights Reserved.", 
    "response": { 
    "meta": { 
     "hits": 7652 
    }, 
    "docs": [ 
     { 
     "web_url": "http://www.nytimes.com/interactive/2016/technology/personaltech/cord-cutting-guide.html", 
     "snippet": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.", 
     "lead_paragraph": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.", 
     "abstract": null, 
     "print_page": null, 
     "blog": [], 
     "source": "The New York Times", 
     "multimedia": [ 
      { 
      "width": 190, 
      "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg", 
      "height": 126, 
      "subtype": "wide", 
      "legacy": { 
       "wide": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg", 
       "wideheight": "126", 
       "widewidth": "190" 
      }, 
      "type": "image" 
      }, 
      { 
      "width": 600, 
      "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg", 
      "height": 346, 
      "subtype": "xlarge", 
      "legacy": { 
       "xlargewidth": "600", 
       "xlarge": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg", 
       "xlargeheight": "346" 
      }, 
      "type": "image" 
      }, 
      { 
      "width": 75, 
      "url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg", 
      "height": 75, 
      "subtype": "thumbnail", 
      "legacy": { 
       "thumbnailheight": "75", 
       "thumbnail": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg", 
       "thumbnailwidth": "75" 
      }, 
      "type": "image" 
      } 
     ], 
     "headline": { 
      "main": "The Definitive Guide to Cord-Cutting in 2016, Based on Your Habits", 
      "kicker": "Tech Fix" 
     }, 
     "keywords": [ 
      { 
      "rank": "1", 
      "is_major": "N", 
      "name": "subject", 
      "value": "Video Recordings, Downloads and Streaming" 
      }, 
      { 
      "rank": "2", 
      "is_major": "N", 
      "name": "subject", 
      "value": "Television Sets and Media Devices" 
      }, 
      { 
      "rank": "1", 
      "is_major": "Y", 
      "name": "subject", 
      "value": "Television" 
      } 
     ], 
     "pub_date": "2016-01-01T05:00:00Z", 
     "document_type": "multimedia", 
     "news_desk": "Technology/Personal Tech", 
     "section_name": "Technology", 
     "subsection_name": "Personal Tech", 
     "byline": { 
      "person": [ 
      { 
       "firstname": "Brian", 
       "middlename": "X.", 
       "lastname": "CHEN", 
       "rank": 1, 
       "role": "reported", 
       "organization": "" 
      } 
      ], 
      "original": "By BRIAN X. CHEN" 
     }, 
     "type_of_material": "Interactive Feature", 
     "_id": "57fdfb9895d0e022439c2b57", 
     "word_count": null, 
     "slideshow_credits": null 
     }]}} 
+1

您可以发布前几行的整个原始JSON吗? –

+0

补充,请看看。 –

+0

我想读“文档” –

回答

0

你应该能够提取所有在其下嵌套在response字典成数据帧的docs列表中的元素。

import json 
with open('data.json') as f: 
    data = json.load(f) 
df = pd.DataFrame(data['response']['docs']) 
+0

最后一行是给我的错误中大多值:类型错误:列表索引必须是整数或片,而不是STR 你知道为什么是这样呢? 这是因为我正在读取一个包含多个JSON对象的文件吗? –

+0

我通过添加一个闭括号和两个闭合的大括号来修改了json输入。将确切的json直接复制到文件中,然后再次运行我的代码。它应该工作。 –