2016-09-18 70 views
1

我在这里试图确定如何切片我的数据帧。基于groupby条件从数据框中删除值

data = {'Date' : ['08/20/10','08/20/10','08/20/10','08/21/10','08/22/10','08/24/10','08/25/10','08/26/10'] , 'Receipt' : [10001,10001,10002,10002,10003,10004,10004,10004], 
    'Product' : ['xx1','xx2','yy1','fff4','gggg4','fsf4','gggh5','hhhg6']} 

dfTest = pd.DataFrame(data) 
dfTest 

这将产生:

Date Product Receipt 
0 08/20/10 xx1 10001 
1 08/20/10 xx2 10001 
2 08/20/10 yy1 10002 
3 08/21/10 fff4 10002 
4 08/22/10 gggg4 10003 
5 08/24/10 fsf4 10004 
6 08/25/10 gggh5 10004 
7 08/26/10 hhhg6 10004 

我想创建一个新的数据帧只包含独特的收据,意味着接收只应在仅为1天使用(但它可以显示多次在1天内)。如果收据在多天内出现,则需要将其删除。以上数据集应该是这样的:

Date Product Receipt 
0 08/20/10 xx1 10001 
1 08/20/10 xx2 10001 
2 08/22/10 gggg4 10003 

我迄今所做的是:

dfTest.groupby(['Receipt','Date']).count() 

       Product 
Receipt Date  
10001 08/20/10 2 
10002 08/20/10 1 
     08/21/10 1 
10003 08/22/10 1 
10004 08/24/10 1 
     08/25/10 1 
     08/26/10 1 

我不知道如何在这种结构做一个查询该日期,所以我重置索引。

df1 = dfTest.groupby(['Receipt','Date']).count().reset_index() 


Receipt Date Product 
0 10001 08/20/10 2 
1 10002 08/20/10 1 
2 10002 08/21/10 1 
3 10003 08/22/10 1 
4 10004 08/24/10 1 
5 10004 08/25/10 1 
6 10004 08/26/10 1 

现在我不知道如何继续。我希望有人能伸出援助之手。这可能很容易,我只是有点困惑或缺乏经验。

回答

1

您可以使用SeriesGroupBy.nuniqueboolean indexing WHERE条件使用Series.isin

df1 = dfTest.groupby(['Receipt'])['Date'].nunique() 
print (df1) 
Receipt 
10001 1 
10002 2 
10003 1 
10004 3 
Name: Date, dtype: int64 

#get indexes of all rows where length is 1 
print (df1[df1 == 1].index) 
Int64Index([10001, 10003], dtype='int64', name='Receipt') 

#get all rows where in column Receipt are indexes with length 1 
print (dfTest[dfTest['Receipt'].isin(df1[df1 == 1].index)]) 
     Date Product Receipt 
0 08/20/10  xx1 10001 
1 08/20/10  xx2 10001 
4 08/22/10 gggg4 10003 

另一种解决方案,找到状态索引,然后选择DataFrame通过loc

print (dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index) 
Int64Index([0, 1, 4], dtype='int64') 


df1 = dfTest.loc[dfTest.groupby(['Receipt']).filter(lambda x: x.Date.nunique()==1).index] 
print (df1) 
     Date Product Receipt 
0 08/20/10  xx1 10001 
1 08/20/10  xx2 10001 
4 08/22/10 gggg4 10003