2017-07-20 79 views
1

我有一个数据帧my_df,那么我想创建一个新的数据帧new_df。每个new_df列由groupbymy_id创建,然后取my_df中列的max熊猫:为多列数据框实现groupby +聚合的优雅方法?

下面是我的代码,它工作正常。但是,我想知道有没有更好的方法?特别是在未来,我将处理数百列而不是仅仅6列?非常感谢!

tmp_df1 = my_df.groupby(['my_id'], as_index=False).col_A.agg({"max_A": "max"}) 
tmp_df2 = my_df.groupby(['my_id'], as_index=False).col_B.agg({"max_B": "max"}) 
tmp_df3 = my_df.groupby(['my_id'], as_index=False).col_C.agg({"max_C": "max"}) 
tmp_df4 = my_df.groupby(['my_id'], as_index=False).col_D.agg({"max_D": "max"}) 
tmp_df5 = my_df.groupby(['my_id'], as_index=False).col_E.agg({"max_E": "max"}) 
tmp_df6 = my_df.groupby(['my_id'], as_index=False).col_F.agg({"max_F": "max"}) 

combine_df1 = pd.merge(tmp_df1,tmp_df2,how="inner",on=['my_id']) 
combine_df2 = pd.merge(combine_df1,tmp_df3,how="inner",on=['my_id']) 
combine_df3 = pd.merge(combine_df2,tmp_df4,how="inner",on=['my_id']) 
combine_df4 = pd.merge(combine_df3,tmp_df5,how="inner",on=['my_id']) 
new_df = pd.merge(combine_df4,tmp_df6,how="inner",on=['my_id']) 

回答

1

考虑例如数据帧my_df

np.random.seed([3,1415]) 

my_df = pd.DataFrame(dict(
    my_id=list('WXYZ') * 4, 
    col_A=np.random.randint(10, size=16), 
    col_B=np.random.randint(10, size=16), 
    col_C=np.random.randint(10, size=16), 
    col_D=np.random.randint(10, size=16), 
    col_E=np.random.randint(10, size=16), 
    col_F=np.random.randint(10, size=16), 
)) 

my_df 

    col_A col_B col_C col_D col_E col_F my_id 
0  0  3  6  1  9  3  W 
1  2  2  4  5  8  5  X 
2  7  4  7  2  0  8  Y 
3  3  3  6  8  4  5  Z 
4  8  3  2  2  3  1  W 
5  7  6  6  4  0  5  X 
6  0  7  6  7  4  1  Y 
7  6  7  5  6  1  4  Z 
8  8  4  2  9  5  3  W 
9  6  5  8  4  8  9  X 
10  0  3  7  2  6  5  Y 
11  2  7  5  4  0  5  Z 
12  0  5  8  6  8  7  W 
13  4  9  4  3  7  0  X 
14  9  8  7  8  4  3  Y 
15  7  7  6  3  6  2  Z 

你只想用groupbymax

new_df = my_df.groupby('my_id', as_index=False).max() 
new_df 

    my_id col_A col_B col_C col_D col_E col_F 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 

如果你想改变列的名称:

my_df.groupby('my_id').max() \ 
    .rename(columns=lambda x: x.replace('col_', '')) \ 
    .add_suffix('_max').reset_index() 

    my_id A_max B_max C_max D_max E_max F_max 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 

或者

my_df.groupby('my_id').max() \ 
    .rename(columns=lambda x: x.rsplit('_', 1)[1] + '_max').reset_index() 

    my_id A_max B_max C_max D_max E_max F_max 
0  W  8  5  8  9  9  7 
1  X  7  9  8  5  8  9 
2  Y  9  8  7  8  6  8 
3  Z  7  7  6  8  6  5 
+0

是否有可能在过程中给这些new_df列新的名字呢?即A_max而不是col_A,B_max而不是col_B ...等等?我正在尝试以后再次手动重命名每一列......谢谢! – Edamame

+0

@Edamame我已更新我的帖子。 – piRSquared