2017-04-18 47 views
1

我有一些SQL数据,我正在分组和执行某些聚合。它工作得很好:在群组之后填写缺失的行由

grouped = df.groupby(['a', 'b']) 
agged = grouped.aggregate({ 
    c: [numpy.sum, numpy.mean, numpy.size], 
    d: [numpy.sum, numpy.mean, numpy.size] 
}) 

  c       d 
     sum  mean size  sum   mean size 
a b 
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 23 126.0 0.792453 159.0 8795000 55314.465409 159 

但我想,以填补所有处于a=25行而不是在a=26零。换句话说,就像这样:

  c       d 
     sum  mean size  sum   mean size 
a b 
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 20  0   0  0  0    0 0 
    21  0   0  0  0    0 0 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24  0   0  0  0    0 0 
    25  0   0  0  0    0 0 

我该怎么做?

+1

您的输出不匹配你要求。 'a == 25'将是整个第一块。为什么你要在'a == 6'组中清零行? – piRSquared

+0

我可能没有解释得很清楚。我基本上想要在分组完成后用0填写任何缺失的“行”,这样在别处使用时数据可以更“完整”。 –

+0

[Pandas分类子组0的计数]的可能重复(http:// stackoverflow.com/questions/43097140/pandas-category-sub-group-0-counts) – gereleth

回答

2

考虑数据框df

df = pd.DataFrame(
    np.random.randint(10, size=(6, 6)), 
    pd.MultiIndex.from_tuples(
     [(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)], 
     names=['a', 'b'] 
    ), 
    pd.MultiIndex.from_product(
     [['c', 'd'], ['sum', 'mean', 'size']] 
    ) 
) 

     c    d   
     sum mean size sum mean size 
a b        
25 20 8 3 5 5 0 2 
    21 3 7 8 9 2 7 
    23 2 1 3 2 5 4 
    24 9 0 1 7 1 6 
    25 1 9 3 5 8 8 
26 23 8 8 4 8 0 5 

您可以快速从unstack(fill_value=0)笛卡尔乘积,随后stack

df.unstack(fill_value=0).stack() 

     c    d   
     mean size sum mean size sum 
a b        
25 20 3 5 8 0 2 5 
    21 7 8 3 2 7 9 
    23 1 3 2 5 4 2 
    24 0 1 9 1 6 7 
    25 9 3 1 8 8 5 
26 20 0 0 0 0 0 0 
    21 0 0 0 0 0 0 
    23 8 4 8 0 5 8 
    24 0 0 0 0 0 0 
    25 0 0 0 0 0 0 

注恢复所有丢失的行:使用fill_value=0保留dtypeint。没有它,开拆的时候,空白得到填补与NaNdtypes地转化为float

1

打印(DF)

  c       d     
     sum  mean size  sum   mean size 
a b              
25 20 107.0 0.804511 133.0 5328000 40060.150376 133 
    21 110.0 0.774648 142.0 6031000 42471.830986 142 
    23 126.0 0.792453 159.0 8795000 55314.465409 159 
    24 72.0 0.947368 76.0 2920000 38421.052632 76 
    25 54.0 0.818182 66.0 2570000 38939.393939 66 
26 23 126.0 0.792453 159.0 8795000 55314.465409 159 

我喜欢:

df = df.unstack().replace(np.nan,0).stack(-1) 
print(df) 
        c       d     
       mean size sum   mean size  sum 
    a b               
    25 20 0.804511 133.0 107.0 40060.150376 133.0 5328000.0 
     21 0.774648 142.0 110.0 42471.830986 142.0 6031000.0 
     23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0 
     24 0.947368 76.0 72.0 38421.052632 76.0 2920000.0 
     25 0.818182 66.0 54.0 38939.393939 66.0 2570000.0 
    26 20 0.000000 0.0 0.0  0.000000 0.0  0.0 
     21 0.000000 0.0 0.0  0.000000 0.0  0.0 
     23 0.792453 159.0 126.0 55314.465409 159.0 8795000.0 
     24 0.000000 0.0 0.0  0.000000 0.0  0.0 
     25 0.000000 0.0 0.0  0.000000 0.0  0.0