熊猫DataFrame内的JSON对象

我有一个熊猫数据框列中的JSON对象，我想拆分并放入其他列。在数据框中，JSON对象看起来像一个包含字典数组的字符串。该数组可以是可变长度的，包括零，或者该列甚至可以为空。我写了一些代码，如下所示，这是我想要的。列名由两个组件构成，第一个是字典中的键，第二个是字典中键值的子字符串。熊猫DataFrame内的JSON对象

此代码工作正常，但在大数据框上运行时速度非常慢。任何人都可以提供更快（也可能更简单）的方式来做到这一点？此外，如果您发现某些不合理/高效/ pythonic的东西，请随时挑选我已完成的工作。我仍然是一个相对的初学者。感谢堆。

# Import libraries 
import pandas as pd 
from IPython.display import display # Used to display df's nicely in jupyter notebook. 
import json 

# Set some display options 
pd.set_option('max_colwidth',150) 

# Create the example dataframe 
print("Original df:") 
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\ 
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\ 
    1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\ 
    2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\ 
    3: '[]',\ 
    4: None}}) 
display(df) 

# Create a temporary dataframe to append results to, record by record 
dfTemp = pd.DataFrame() 

# Step through all rows in the dataframe 
for i in range(df.shape[0]): 
    # Check whether record is null, or doesn't contain any real data 
    if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2: 
     # Convert the json structure into a dataframe, one cell at a time in the relevant column 
     x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")]) 
     # The last bit of this string (after the last =) will be used as a key for the column labels 
     x['key'] = x['key'].apply(lambda x: x.split("=")[-1]) 
     # Set this new key to be the index 
     y = x.set_index('key') 
     # Stack the rows up via a multi-level column index 
     y = y.stack().to_frame().T 
     # Flatten out the multi-level column index 
     y.columns = ['{1}_{0}'.format(*c) for c in y.columns] 
     # Give the single record the same index number as the parent dataframe (for the merge to work) 
     y.index = [df.index[i]] 
     # Append this dataframe on sequentially for each row as we go through the loop 
     dfTemp = dfTemp.append(y) 

# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe 
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True) 

print("Processed df:") 
display(df)

来源

2017-08-15 Michael

只是一件小事。您可以用'for i，col_b in enumerate（df.iloc [：，df.columns.get_loc（“ColB”）]）：'替换您的循环，并相应地更改对该条目的引用以提高可读性。 – Nyps

谢谢！这当然会使它更加简洁和可读。 – Michael

首先，对熊猫的一般建议。 如果你发现自己遍历数据帧的行，你很可能做错了。

：

考虑到这一点，我们可以用大熊猫“应用”的方法（这可能会加速这一过程，首先，因为它意味着对东风少得多的索引查找）重新写你目前的程序

# Check whether record is null, or doesn't contain any real data 
def do_the_thing(row): 
    if pd.notnull(row) and len(row) > 2: 
     # Convert the json structure into a dataframe, one cell at a time in the relevant column 
     x = pd.read_json(row) 
     # The last bit of this string (after the last =) will be used as a key for the column labels 
     x['key'] = x['key'].apply(lambda x: x.split("=")[-1]) 
     # Set this new key to be the index 
     y = x.set_index('key') 
     # Stack the rows up via a multi-level column index 
     y = y.stack().to_frame().T 
     # Flatten out the multi-level column index 
     y.columns = ['{1}_{0}'.format(*c) for c in y.columns] 

     #we don't need to re-index 
      # Give the single record the same index number as the parent dataframe (for the merge to work) 
      #y.index = [df.index[i]] 
     #we don't need to add to a temp df 
     # Append this dataframe on sequentially for each row as we go through the loop 
     return y.iloc[0] 
    else: 
     return pd.Series() 
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)

请注意，这返回与以前完全相同的结果，我们没有更改逻辑。 apply方法对索引进行排序，所以我们可以合并，很好。

我相信在加快速度和更加习惯方面可以回答你的问题。

我认为你应该考虑一下，然而，你想要用这个数据结构来做什么，以及你如何更好地构造你正在做的事情。

考虑到ColB可以是任意长度的，你最终将得到一个任意数量的列的数据帧。当你为了任何目的而访问这些值时，无论目的是什么，这都会导致你痛苦。

ColB中的所有条目都很重要吗？你能保持第一个吗？你需要知道某个valA val的索引吗？

这些是你应该问问自己的问题，然后决定一个结构，这将允许你做任何你需要的分析，而不必检查一堆任意的东西。

来源

2017-08-15 16:01:01

非常感谢您的全面回应，非常感谢！你的代码更简单，更好，更容易重用。我实施了您的建议，并将执行时间缩短了20％。也感谢其他建议。我同意我的整体做法并不好。一种可能性是从列中创建一个新的数据框，用一个新的列来指定“关键”值。因此，我不会为每个键值添加一组新的列，而是添加一组新的行。下次我会尝试 - 如果我能弄清楚如何去做。 :-) – Michael

熊猫DataFrame内的JSON对象

回答

相关问题