2017-06-02 68 views
2

我有一个数据帧,我需要将两个不同的groupbys与其中一个过滤。大熊猫结合了两个分组的,过滤和合并组(计数)

ID  EVENT  SUCCESS 
1  PUT   Y 
2  POST   Y 
2  PUT   N 
1  DELETE  Y 

下面这个表格是我想要的数据。首先分组的“事件”的计数,二是每ID计数成功(“Y”)的量

ID PUT POST DELETE SUCCESS 
1 1  0  1  2 
2 1  1  0  1 

我已经尝试了一些技术和我发现壁橱是两种不同的方法,其产生以下

group_df = df.groupby(['ID', 'EVENT']) count_group_df = group_df.size().unstack()

其中产量为“事件”下面的计算

ID PUT POST DELETE 
1 1  0  1  
2 1  1  0  

对于过滤器的成功,我不知道是否我可以加入这个'ID'的第一套

df_success = df.loc[df['SUCCESS'] == 'Y', ['ID', 'SUCCESS']] 
count_group_df_2 = df_success.groupby(['ID', 'SUCCESS']) 


ID SUCCESS 
1  2 
2  1 

我需要结合这些不知何故?

此外,我还想将'事件'的计数两个例如PUT和POST合并到一列中。

回答

1

使用concat为它们合并起来:

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int) 
df = pd.concat([df1, df_success],axis=1) 
print (df) 
    DELETE POST PUT SUCCESS 
ID        
1  1  0 1  2 
2  0  1 1  1 

value_counts另一种解决方案:

df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS') 
df = pd.concat([df1, df_success],axis=1) 
print (df) 
    DELETE POST PUT SUCCESS 
ID        
1  1  0 1  2 
2  0  1 1  1 

最后可能转换索引列和删除列命名ID通过reset_index + rename_axis

df = df.reset_index().rename_axis(None, axis=1) 
print (df) 
    ID DELETE POST PUT SUCCESS 
0 1  1  0 1  2 
1 2  0  1 1  1 
1

pandas

pd.get_dummies(df.EVENT) \ 
    .assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)) \ 
    .groupby(df.ID).sum().reset_index() 

    ID DELETE POST PUT SUCCESS 
0 1  1  0 1  2 
1 2  0  1 1  1 

numpypandas

f, u = pd.factorize(df.EVENT.values) 
n = u.size 
d = np.eye(n)[f] 
s = (df.SUCCESS.values == 'Y').astype(int) 
d1 = pd.DataFrame(
    np.column_stack([d, s]), 
    df.index, np.append(u, 'SUCCESS') 
) 
d1.groupby(df.ID).sum().reset_index() 

    ID DELETE POST PUT SUCCESS 
0 1  1  0 1  2 
1 2  0  1 1  1 

时序
小数据

%%timeit 
f, u = pd.factorize(df.EVENT.values) 
n = u.size 
d = np.eye(n)[f] 
s = (df.SUCCESS.values == 'Y').astype(int) 
d1 = pd.DataFrame(
    np.column_stack([d, s]), 
    df.index, np.append(u, 'SUCCESS') 
) 
d1.groupby(df.ID).sum().reset_index() 
1000 loops, best of 3: 1.32 ms per loop 

%%timeit 
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int) 
pd.concat([df1, df_success],axis=1).reset_index() 
100 loops, best of 3: 3.3 ms per loop 

%%timeit 
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS') 
pd.concat([df1, df_success],axis=1).reset_index() 
100 loops, best of 3: 3.28 ms per loop 

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index() 
100 loops, best of 3: 2.62 ms per loop 

大数据

df = pd.DataFrame(dict(
     ID=np.random.randint(100, size=100000), 
     EVENT=np.random.choice('PUT POST DELETE'.split(), size=100000), 
     SUCCESS=np.random.choice(list('YN'), size=100000) 
    )) 

%%timeit 
f, u = pd.factorize(df.EVENT.values) 
n = u.size 
d = np.eye(n)[f] 
s = (df.SUCCESS.values == 'Y').astype(int) 
d1 = pd.DataFrame(
    np.column_stack([d, s]), 
    df.index, np.append(u, 'SUCCESS') 
) 
d1.groupby(df.ID).sum().reset_index() 
100 loops, best of 3: 10.8 ms per loop 

%%timeit 
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = (df['SUCCESS'] == 'Y').groupby(df['ID']).sum().astype(int) 
pd.concat([df1, df_success],axis=1).reset_index() 
100 loops, best of 3: 17.7 ms per loop 

%%timeit 
df1 = df.groupby(['ID', 'EVENT']).size().unstack(fill_value=0) 
df_success = df.loc[df['SUCCESS'] == 'Y', 'ID'].value_counts().rename('SUCCESS') 
pd.concat([df1, df_success],axis=1).reset_index() 
100 loops, best of 3: 17.4 ms per loop 

%timeit pd.get_dummies(df.EVENT).assign(SUCCESS=df.SUCCESS.eq('Y').astype(int)).groupby(df.ID).sum().reset_index() 
100 loops, best of 3: 16.8 ms per loop