熊猫降低分类变量

新的号码大熊猫我要（以分类变量的分级，以减少他们的水平）执行类似于Reduce number of levels for large categorical variables东西下面的代码工作中的R熊猫降低分类变量

DTsetlvls <- function(x, newl) 
    setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

我的数据框罚款：

df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 
        'Value': [100, 150, 50]}) 

df['Counts'] = df.groupby('Color')['Value'].transform('count') 
print (df) 

    Color Value Counts 
0 Red 100  2 
1 Red 150  2 
2 Blue  50  1

我手动创建一个聚合列，然后基于此，标记较不频繁的组，例如“蓝色”作为单个“其他”组。但与简洁的R代码相比，这看起来很笨拙。这里的正确方法是什么？

来源

2016-08-23 Georg Heiler

可能[如何将“剩余的”结果分组到上N以外的结果复制到“O”中thers“with pandas]（http://stackoverflow.com/questions/19835746/how-to-group-remaining-results-beyond-top-n-into-others-with-pandas） –

我认为你可以使用value_counts与numpy.where，这里是条件与isin：

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
        'Value':[11,150,50,30,10,40]}) 
print (df) 
    Color Value 
0  Red  11 
1  Red 150 
2 Blue  50 
3  Red  30 
4 Violet  10 
5 Blue  40 

a = df.Color.value_counts() 
print (a) 
Red  3 
Blue  2 
Violet 1 
Name: Color, dtype: int64 

#get top 2 values of index 
vals = a[:2].index 
print (vals) 
Index(['Red', 'Blue'], dtype='object')

df['new'] = np.where(df.Color.isin(vals), 0,1) 
print (df) 
    Color Value new 
0  Red  11 0 
1  Red 150 0 
2 Blue  50 0 
3  Red  30 0 
4 Violet  10 1 
5 Blue  40 0

或者，如果需要更换所有不顶值使用where：

df['new1'] = df.Color.where(df.Color.isin(vals), 'other') 
print (df) 
    Color Value new1 
0  Red  11 Red 
1  Red 150 Red 
2 Blue  50 Blue 
3  Red  30 Red 
4 Violet  10 other 
5 Blue  40 Blue

来源

2016-08-23 11:30:08 jezrael

熊猫降低分类变量

回答

相关问题