2017-10-20 81 views
2

我有一个数据框,以便该列包含json对象和字符串。我想摆脱不包含json对象的行。从python数据框列中删除非json对象行

下面是我的数据框的样子:

import pandas as pd 

df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"a":9,"b":10,"c":11}]}) 

print(df) 

我应该如何删除只包含字符串的行,使消除这些字符串行后,我可以在下面适用于本列JSON对象转换成据帧的单独列:

from pandas.io.json import json_normalize 
df = json_normalize(df['A']) 
print(df) 
+0

一旦你做了这不是你的JSON df,这是一个字典。但它让我占有尝试有选择地保持那些列肯定:) – roganjosh

+0

是的,由json我的意思是只有dict对象。任何想法如何删除所有包含像“你好”,“世界”等简单字符串的行 –

+0

请问这个问题https://stackoverflow.com/questions/46856988/np-isreal-behavior-different-in- pandas-dataframe-and-numpy-array – Wen

回答

3

我想我会喜欢使用isinstance检查:

In [11]: df.loc[df.A.apply(lambda d: isinstance(d, dict))] 
Out[11]: 
          A 
2 {'a': 5, 'b': 6, 'c': 8} 
5 {'d': 9, 'e': 10, 'f': 11} 

如果要包括数字也一样,你可以这样做:

In [12]: df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))] 
Out[12]: 
          A 
2 {'a': 5, 'b': 6, 'c': 8} 
5 {'d': 9, 'e': 10, 'f': 11} 

调整这要包括哪个类型?


的最后一步,json_normalize需要json对象列表,无论出于何种原因系列不好(并给出KeyError),您可以将其作为一个列表并且您的好行为:

In [21]: df1 = df.loc[df.A.apply(lambda d: isinstance(d, (dict, np.number)))] 

In [22]: json_normalize(list(df1["A"])) 
Out[22]: 
    a b c d  e  f 
0 5.0 6.0 8.0 NaN NaN NaN 
1 NaN NaN NaN 9.0 10.0 11.0 
+0

我更喜欢这个答案。由于其他讨论似乎没有进行,你碰巧知道为什么“isreal”有效,所以你可以指引我在阅读的正确方向? – roganjosh

+0

在应用您的代码后应用“规范化代码”,它会给出关键错误。 –

+1

@roganjosh我不知道,我认为你需要看代码 - 我不认为np.isreal是打算像那样使用的(我不想依赖它) –

1
df[df.applymap(np.isreal).sum(1).gt(0)] 
Out[794]: 
          A 
2 {'a': 5, 'b': 6, 'c': 8} 
5 {'d': 9, 'e': 10, 'f': 11} 
+1

请解释一下,它到底在做什么 –

+0

我也对这样做有困惑。文档不会提供太多,当然对于字符串。这是副作用吗? – roganjosh

+0

'df [df.applymap(np.isreal).values]'可能更简洁一点。 – cmaher

0

如果你想要一个丑陋的解决方案,也可以......这里是我创建的一个函数,它查找只包含字符串的列,并返回df减去那些行。 (因为你的df只有一列,你只需要包含所有字典的1列的数据框)。然后,从那里开始,您需要使用 df = json_normalize(df['A'].values)而不仅仅是df = json_normalize(df['A'])

对于单个列数据框...

import pandas as pd 
import numpy as np 
from pandas.io.json import json_normalize 
def delete_strings(df): 
    nrows = df.shape[0] 
    rows_to_keep = [] 
    for row in np.arange(nrows): 
     if type(df.iloc[row,0]) == dict: 
      rows_to_keep.append(row) #add the row number to list of rows 
            #to keep if the row contains a dict 
    return df.iloc[rows_to_keep,0] #return only rows with dicts 
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india", 
         {"a":9,"b":10,"c":11}]}) 
df = delete_strings(df) 
df = json_normalize(df['A'].values) 
print(df) 
#0  {'a': 5, 'b': 6, 'c': 8} 
#1 {'a': 9, 'b': 10, 'c': 11} 

对于多列DF(还与一列DF):

def delete_rows_of_strings(df): 
    rows = df.shape[0] #of rows in df 
    cols = df.shape[1] #of coluns in df 
    rows_to_keep = [] #list to track rows to keep 
    for row in np.arange(rows): #for every row in the dataframe 
     #num_string will count the number of strings in the row 
     num_string = 0 
     for col in np.arange(cols): #for each column in the row... 
      #if the value is a string, add one to num_string 
      if type(df.iloc[row,col]) == str: 
       num_string += 1 
     #if num_string, the number of strings in the column, 
     #isn't equal to the number of columns in the row... 
     if num_string != cols: #...add that row number to the list of rows to keep 
      rows_to_keep.append(row) 
    #return the df with rows containing at least one non string 
    return(df.iloc[rows_to_keep,:]) 


df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india"], 
         'B' : ['hi',{"a":5,"b":6,"c":8},'sup','america','china']}) 
#       A       B 
#0      hello      hi 
#1      world {'a': 5, 'b': 6, 'c': 8} 
#2 {'a': 5, 'b': 6, 'c': 8}      sup 
print(delete_rows_of_strings(df)) 
#       A       B 
#1      world {'a': 5, 'b': 6, 'c': 8} 
#2 {'a': 5, 'b': 6, 'c': 8}      sup