2015-07-22 139 views
0

第一次在这里发布海报,所以如果我没有完全正确地回答这个问题,请致歉。花了很多年在Excel和PowerPivot中操纵数据,但是当前的项目需要更多的提升功能。一直在看熊猫,认为它可以胜任处理这项工作,但我被卡住了。PANDAS:将数据帧中的计算数据合并到主数据帧中

我试图计算的天购买的数量为每一个客户

我最初的数据帧是这样的:

customer_id date  invoice_amt 
0 101A  21/03/2012 654.76  
1 101A  1/02/2012 234.45  
2 102A  23/01/2012 99.45  
3 104B  18/12/2011 767.63  
4 101A  9/12/2011 124.76  
5 104B  27/11/2011 346.87  
6 102A  18/11/2011 652.65  
7 104B  12/10/2011 765.21  
8 101A  1/10/2011 275.76  
9 102A  21/09/2011 532.21 

我的目标数据框的样子:

customer_id date  invoice_amt days_since 
0 101A  21/03/2012 654.76  49 
1 101A  1/02/2012 234.45  54 
2 102A  23/01/2012 99.45  66 
3 104B  18/12/2011 767.63  21 
4 101A  9/12/2011 124.76  69 
5 104B  27/11/2011 346.87  46 
6 102A  18/11/2011 652.65  58 
7 104B  12/10/2011 765.21  NaN 
8 101A  1/10/2011 275.76  NaN 
9 102A  21/09/2011 532.21  NaN 

我已经到了能够计算每个分组数据框中days_since值的程度,但不知道如何将值返回到主数据框(data_df)

任何帮助将是非常赞赏...谢谢

import pandas as pd 
#import numpy as np 

#dataframe data note: no_days_since_last_purchase hard coded for testing purposes 
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ], 
      'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'], 
      'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ], 
      'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']} 

data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True) 

#convert date str to date type 
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d') 

#group dataframe by customer_id 
grouped_data = data_df.groupby(['customer_id'])  

#for each row in each grouped dataframe calculate the difference in days between current and previous 
#if there is no previous then use 2000-01-01 then convert to integer 
for customer_id, group in grouped_data: 
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]') 
    print group 

OUTPUT:

customer_id  date invoice_amt no_days_since_last_purchase days_since 
8  101A 2011-10-01  275.76       NaN  4291 
4  101A 2011-12-09  124.76       69   69 
1  101A 2012-02-01  234.45       54   54 
0  101A 2012-03-21  654.76       49   49 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
9  102A 2011-09-21  532.21       NaN  4281 
6  102A 2011-11-18  652.65       58   58 
2  102A 2012-01-23  99.45       66   66 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
7  104B 2011-10-12  765.21       NaN  4302 
5  104B 2011-11-27  346.87       46   46 
3  104B 2011-12-18  767.63       21   21 

哦,我得到 SettingWithCopyWarning: 值正试图在一组从DataFrame中复制切片。 尝试使用.loc [row_indexer,col_indexer] =值代替

有关我应该如何避免此警告的任何想法也将不胜感激。

+0

的[从数据框中设置上的一个切片的副本值(可能重复http://stackoverflow.com/questions/31468176/setting-values-on-a -copy对的一排从 - 一个非数据帧) – firelynx

回答

0
df_container = [] 
for customer_id, group in grouped_data: 
    group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]') 
    df_container.append(group) 

data_df = pd.concat(df_container) 

也许这就是你想要的吗?

customer_id  date invoice_amt no_days_since_last_purchase days_since 
8  101A 2011-10-01  275.76       NaN  4291 
4  101A 2011-12-09  124.76       69   69 
1  101A 2012-02-01  234.45       54   54 
0  101A 2012-03-21  654.76       49   49 
9  102A 2011-09-21  532.21       NaN  4281 
6  102A 2011-11-18  652.65       58   58 
2  102A 2012-01-23  99.45       66   66 
7  104B 2011-10-12  765.21       NaN  4302 
5  104B 2011-11-27  346.87       46   46 
3  104B 2011-12-18  767.63       21   21 
1

使用transform产生一系列与它的标记对齐到原来的DF,就可以指定为新的一列,此外,您不能使用投datetime64[ns]astypetimedelta[D]让你有一个额外的步骤来调用to_timedelta

In [193]: 
data_df['days_since'] = data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))) 
data_df['days_since'] = pd.to_timedelta(data_df['days_since']) 
data_df 

Out[193]: 
    customer_id  date invoice_amt no_days_since_last_purchase days_since 
9  102A 2011-09-21  532.21       NaN 4281 days 
8  101A 2011-10-01  275.76       NaN 4291 days 
7  104B 2011-10-12  765.21       NaN 4302 days 
6  102A 2011-11-18  652.65       58  58 days 
5  104B 2011-11-27  346.87       46  46 days 
4  101A 2011-12-09  124.76       69  69 days 
3  104B 2011-12-18  767.63       21  21 days 
2  102A 2012-01-23  99.45       66  66 days 
1  101A 2012-02-01  234.45       54  54 days 
0  101A 2012-03-21  654.76       49  49 days 

编辑

其实你可以拨打to_timedelta对返回的系列,像这样:

data_df['days_since'] = pd.to_timedelta(data_df.groupby(['customer_id'])['date'].transform(lambda x: x - x.shift().fillna(pd.datetime(2000,1,1))))