2016-11-11 42 views
1

鉴于这一数据框:熊猫 - 得到排概括支点据帧计数

bowl cookie 
0 one  chocolate 
1 two  chocolate 
2 two  chocolate 
3 two  vanilla 
4 one  vanilla 
5 one  vanilla 
6 one  vanilla 
7 one  vanilla 
8 one  vanilla 
9 two  chocolate 

我希望得到以下总结数据框:从人工出发

 vanilla  chocolate 
one  5   1 
two  1   3 

除了:

vanilla_bowl1 = len(df_picks[(df_picks['bowl'] == 'one') & (df_picks['cookie'] == 'vanilla')]) 
vanilla_bowl2 = len(df_picks[(df_picks['bowl'] == 'two') & (df_picks['cookie'] == 'vanilla')]) 
chocolate_bowl1 = ... 
chocolate_bowl2 = ... 

有没有办法做到这一点与Pandas单一操作?


注意:我在df.pivot()一看,这将工作提供了我的每一行中添加count等于一列1

bowl cookie  count 
0 one  chocolate  1 
1 two  chocolate  1 
2 two  chocolate  1 
3 two  vanilla   1 
4 one  vanilla   1 
5 one  vanilla   1 
6 one  vanilla   1 
7 one  vanilla   1 
8 one  vanilla   1 
9 two  chocolate  1 

然后

df.pivot(index='bowl', columns='cookie', values='count') 

但是,我想知道是否有更直接的方法,这将不需要添加count列在第一位。

回答

3

最简洁的方式可能是pandas.crosstab功能:

>>> pandas.crosstab(d.bowl, d.cookie) 
cookie chocolate vanilla 
bowl      
one    1  5 
two    3  1 
+0

确实如此,但与pivot_table和'groupby([...])相比,速度稍慢。aggfunc()。unstack'解决方案 – MaxU

2

您可以使用pivot_table()方法:

In [33]: df.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0) 
Out[33]: 
cookie chocolate vanilla 
bowl 
one    1  5 
two    3  1 

或者您可以使用groupby()size()unstack() - 这就是pivot_table()是怎么做的引擎盖下:

In [36]: df.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0) 
Out[36]: 
cookie chocolate vanilla 
bowl 
one    1  5 
two    3  1 

时序100K行DF:

In [48]: big = pd.concat([df] * 10**4, ignore_index=True) 

In [49]: big.shape 
Out[49]: (100000, 2) 

In [50]: %timeit pd.crosstab(big.bowl, big.cookie) 
10 loops, best of 3: 58 ms per loop 

In [51]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0) 
10 loops, best of 3: 38.4 ms per loop 

In [52]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0) 
10 loops, best of 3: 34.2 ms per loop 

In [118]: %timeit pir(big) 
1 loop, best of 3: 631 ms per loop 

In [119]: big.shape 
Out[119]: (100000, 2) 

时序1M行DF:

In [53]: big = pd.concat([big] * 10, ignore_index=True) 

In [54]: big.shape 
Out[54]: (1000000, 2) 

In [55]: %timeit pd.crosstab(big.bowl, big.cookie) 
1 loop, best of 3: 446 ms per loop 

In [56]: %timeit big.pivot_table(index='bowl', columns='cookie', aggfunc='size', fill_value=0) 
1 loop, best of 3: 333 ms per loop 

In [57]: %timeit big.groupby(['bowl', 'cookie']).size().unstack('cookie', fill_value=0) 
1 loop, best of 3: 327 ms per loop 

In [121]: %timeit pir(big) 
1 loop, best of 3: 7.08 s per loop 

In [122]: big.shape 
Out[122]: (1000000, 2) 
+1

你能加我的方法到你的时间?谢谢 – piRSquared

+0

@piRSquared,为'pir()'函数添加了时机 – MaxU

+0

哇!我错过了那个标记。我必须找到另一种方式。 Thx – piRSquared

1

一个numpy的方法

from itertools import product 
import pandas as pd 
import numpy as np 

def pir(df): 
    ub = pd.Index(np.unique(df.values[:, 0]), name='bowl') 
    uc = pd.Index(np.unique(df.values[:, 1]), name='cookie') 
    u = np.array(list(product(ub.values, uc.values))) 
    e = u[:, None] == df.values 

    return pd.DataFrame(
     e.all(2).sum(1).reshape(-1, 2), 
     ub, uc 
    ) 

pir(df) 

enter image description here