替换所有数据帧列中NULL和空值的相应列中最频繁的Non Null项

我是Python的新手 - 我试图替换NULL和空值（''）值在列熊猫数据框与该列中最频繁的项目。但我需要能够为数据框的所有列和所有行执行此操作。我写了下面的代码 - 但它需要很多时间来执行。你能帮我优化吗？替换所有数据帧列中NULL和空值的相应列中最频繁的Non Null项

感谢 Saptarshi

for column in df: 
    #Get the value and frequency from the column 
    tempDict = df[column].value_counts().to_dict() 

    #pop the entries for 'NULL' and '?' 
    tempDict.pop(b'NULL',None) 
    tempDict.pop(b'?',None) 

    #identify the max item of the remaining set 
    maxItem = max(tempDict) 

    #The next step is to replace all rows where '?' or 'null' appears with maxItem 
    #df_test[column] = df_test[column].str.replace(b'NULL', maxItem) 
    #df_test[column] = df_test[column].str.replace(b'?', maxItem) 
    df[column][df[column] == b'NULL'] = maxItem 
    df[column][df[column] == b'?'] = maxItem

来源

2017-10-18 Saptarshi Chaudhuri

当没有“最频繁”项目时（即所有值为空，或者当多个项目并列时），您想要什么行为？ – ASGM

您可以使用mode()找到每列最常见的值：

for val in ['', 'NULL', '?']: 
    df.replace(val, df.mode().iloc[0])

因为可能有多个模态值，mode()返回一个数据帧。使用.iloc[0]将从该数据帧中获取第一个值。您可以使用fillna()而不是replace()，如果您还想将NaN值转换为@Wen。

来源

2017-10-18 17:16:43 ASGM

我在这里创建一个示例数据。

df = pd.DataFrame({'col1': [6,3,'null',4,4,2,'?'], 'col2': [6,3,2,'null','?',2,2]}) 
df.replace({'?':np.nan},inplace=True) 
df.replace({'null':np.nan},inplace=True) 
df.fillna(df.apply(lambda x : x.mode()[0])) 

Out[98]: 
    col1 col2 
0 6.0 6.0 
1 3.0 3.0 
2 4.0 2.0 
3 4.0 2.0 
4 4.0 2.0 
5 2.0 2.0 
6 4.0 2.0

来源

2017-10-18 17:19:09 Wen

欣赏详细的解释 - 谢谢，这真是一个伟大的社区 –

替换所有数据帧列中NULL和空值的相应列中最频繁的Non Null项

回答

相关问题