2016-11-29 209 views
5

我有一个具有多个对象,如JSON文件:转换JSON到大熊猫数据帧

{"reviewerID": "bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd", "asin": "a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764", "overall": 2.0, "helpful": {"nHelpful": 1, "outOf": 1}, "reviewText": "This remote, for whatever reason, was chosen by Time Warner to replace their previous silver remote, the Time Warner Synergy V RC-U62CP-1.12S. The actual function of this CLIKR-5 is OK, but the ergonomic design sets back remotes by 20 years. The buttons are all the same, there's no separation of the number buttons, the volume and channel buttons are the same shape as the other buttons on the remote, and it all adds up to a crappy user experience. Why would TWC accept this as a replacement? I'm skipping this and paying double for a refurbished Synergy V.", "summary": "Ergonomic nightmare", "unixReviewTime": 1397433600} 

{"reviewerID": "3689286c8658f54a2ff7aa68ce589c81f6cae4c4d9de76fa0f66d5c114f79837", "asin": "8939d791e9dd035aa58da024ace69b20d651cea4adf6159d984872b44f663301", "overall": 4.0, "helpful": {"nHelpful": 21, "outOf": 22}, "reviewText": "This is a great truck GPS. I've tried others and nothing seems to come close to the Rand McNally TND-700.Excellent screen size and resolution. The audio is loud enough to be heard over road noise and the purr of my Kenworth/Cat engine. I've used it for the last 8,000 miles or so and it has only glitched once. Just restarted it and it picked up on my route right where it should have.Clean up the minor issues and this unit rates a solid 5.Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS", "summary": "Great Unit!", "unixReviewTime": 1280016000} 

我尝试使用下面的代码将其转换为一个熊猫数据帧:

train_df = pd.DataFrame() 
count = 0; 
for l in open('train.json'): 
    try: 
     count +=1 
     if(count==20001): 
      break 
     obj1 = json.loads(l) 
     df1=pd.DataFrame(obj1, index=[0]) 
     train_df = train_df.append(df1, ignore_index=True) 
    except ValueError: 
     line = line.replace('\\','') 
     obj = json.loads(line) 
     df1=pd.DataFrame(obj, index=[0]) 
     train_df = train_df.append(df1, ignore_index=True) 

然而,它给了我'NaN'嵌套值,即'有用的'属性。我想要的输出,使两个嵌套属性的键是一个单独的列。

编辑:

P.S:我使用try /除外,因为我有“\”字符在某些对象,给了我一个JSON解码错误。

任何人都可以帮忙吗?有没有其他方法可以使用?

谢谢。

+0

你试过'pandas.read_json'? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html – DeepSpace

+0

@DeepSpace是的,我有。它给我错误说ValueError:'尾随数据' –

+0

尾随数据意味着您的文件中有不是json对象的一部分的额外数据。看看你的文件,并确保它是所有有效的json。 – RichSmith

回答

4

这对大量JSON对象的合理速度执行字典的名单上使用json_normalize

from pandas.io.json import json_normalize 

my_list = [] 
with open('train.json') as f: 
    for line in f: 
     line = line.replace('\\','') 
     my_list.append(json.loads(line)) 

# avoid transposing if you want to keep keys as columns of the dataframe 
result_df = json_normalize(my_list).T 

enter image description here

0

尝试:

pd.concat([pd.Series(json.loads(line)) for line in open('train.json')], axis=1) 

enter image description here

+0

这似乎工作。有没有办法让我可以为前100个对象做上述解决方案,并将它们存储在单独的数据框中?该文件非常大,我无法运行上述解决方案来运行整个文件。另外,有没有一种方法可以使用try /除此之外?因为我在一些对象中有一个'\',它给了我一个JsonDecodeError –