2017-05-31 52 views
4

我有一个看起来像这样的数据:大熊猫或Python相当于tidyr完整

library("tidyverse") 

df <- tibble(user = c(1, 1, 2, 3, 3, 3), x = c("a", "b", "a", "a", "c", "d"), y = 1) 
df 

# user  x  y 
# 1  1  a  1 
# 2  1  b  1 
# 3  2  a  1 
# 4  3  a  1 
# 5  3  c  1 
# 6  3  d  1 

Python的格式:

import pandas as pd 
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1}) 

我想“完整”的数据帧,这样每user有每个可能的x的记录,默认y填充设置为0.

这在R(tidyverse/tidyr)中有些微不足道:

df %>% 
    complete(nesting(user), x = c("a", "b", "c", "d"), fill = list(y = 0)) 

# user  x  y 
# 1  1  a  1 
# 2  1  b  1 
# 3  1  c  0 
# 4  1  d  0 
# 5  2  a  1 
# 6  2  b  0 
# 7  2  c  0 
# 8  2  d  0 
# 9  3  a  1 
# 10 3  b  0 
# 11 3  c  1 
# 12 3  d  1 

在pandas/python中是否有等效的complete会产生相同的结果?

回答

4

您可以通过MultiIndex.from_product使用reindex

df = df.set_index(['user','x']) 
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x']) 
df = df.reindex(mux, fill_value=0).reset_index() 
print (df) 
    user x y 
0  1 a 1 
1  1 b 1 
2  1 c 0 
3  1 d 0 
4  2 a 1 
5  2 b 0 
6  2 c 0 
7  2 d 0 
8  3 a 1 
9  3 b 0 
10  3 c 1 
11  3 d 1 

或者set_index + stack + unstack

df = df.set_index(['user','x'])['y'].unstack(fill_value=0).stack().reset_index(name='y') 
print (df) 
    user x y 
0  1 a 1 
1  1 b 1 
2  1 c 0 
3  1 d 0 
4  2 a 1 
5  2 b 0 
6  2 c 0 
7  2 d 0 
8  3 a 1 
9  3 b 0 
10  3 c 1 
11  3 d 1 
+1

你咬碎了这一点在3分钟内?虚幻 – emehex