第一次在这里发布海报,所以如果我没有完全正确地回答这个问题,请致歉。花了很多年在Excel和PowerPivot中操纵数据,但是当前的项目需要更多的提升功能。一直在看熊猫,认为它可以胜任处理这项工作,但我被卡住了。PANDAS:将数据帧中的计算数据合并到主数据帧中
我试图计算的天购买的数量为每一个客户
我最初的数据帧是这样的:
customer_id date invoice_amt
0 101A 21/03/2012 654.76
1 101A 1/02/2012 234.45
2 102A 23/01/2012 99.45
3 104B 18/12/2011 767.63
4 101A 9/12/2011 124.76
5 104B 27/11/2011 346.87
6 102A 18/11/2011 652.65
7 104B 12/10/2011 765.21
8 101A 1/10/2011 275.76
9 102A 21/09/2011 532.21
我的目标数据框的样子:
customer_id date invoice_amt days_since
0 101A 21/03/2012 654.76 49
1 101A 1/02/2012 234.45 54
2 102A 23/01/2012 99.45 66
3 104B 18/12/2011 767.63 21
4 101A 9/12/2011 124.76 69
5 104B 27/11/2011 346.87 46
6 102A 18/11/2011 652.65 58
7 104B 12/10/2011 765.21 NaN
8 101A 1/10/2011 275.76 NaN
9 102A 21/09/2011 532.21 NaN
我已经到了能够计算每个分组数据框中days_since值的程度,但不知道如何将值返回到主数据框(data_df)
任何帮助将是非常赞赏...谢谢
import pandas as pd
#import numpy as np
#dataframe data note: no_days_since_last_purchase hard coded for testing purposes
my_data = {'customer_id' : ['101A', '101A', '102A', '104B', '101A', '104B', '102A', '104B', '101A', '102A' ],
'date' : ['20120321','20120201','20120123','20111218','20111209','20111127','20111118','20111012','20111001','20110921'],
'invoice_amt' : [654.76, 234.45, 99.45, 767.63, 124.76, 346.87, 652.65, 765.21, 275.76, 532.21 ],
'no_days_since_last_purchase' : ['49', '54', '66', '21', '69', '46', '58', 'NaN', 'NaN', 'NaN']}
data_df = pd.DataFrame(my_data).sort_index(by='date',ascending=True)
#convert date str to date type
data_df['date'] = pd.to_datetime(data_df['date'].astype(str),format='%Y%m%d')
#group dataframe by customer_id
grouped_data = data_df.groupby(['customer_id'])
#for each row in each grouped dataframe calculate the difference in days between current and previous
#if there is no previous then use 2000-01-01 then convert to integer
for customer_id, group in grouped_data:
group['days_since'] = (group['date'] - group['date'].shift().fillna(pd.datetime(2000,1,1))).astype('timedelta64[D]')
print group
OUTPUT:
customer_id date invoice_amt no_days_since_last_purchase days_since
8 101A 2011-10-01 275.76 NaN 4291
4 101A 2011-12-09 124.76 69 69
1 101A 2012-02-01 234.45 54 54
0 101A 2012-03-21 654.76 49 49
customer_id date invoice_amt no_days_since_last_purchase days_since
9 102A 2011-09-21 532.21 NaN 4281
6 102A 2011-11-18 652.65 58 58
2 102A 2012-01-23 99.45 66 66
customer_id date invoice_amt no_days_since_last_purchase days_since
7 104B 2011-10-12 765.21 NaN 4302
5 104B 2011-11-27 346.87 46 46
3 104B 2011-12-18 767.63 21 21
哦,我得到 SettingWithCopyWarning: 值正试图在一组从DataFrame中复制切片。 尝试使用.loc [row_indexer,col_indexer] =值代替
有关我应该如何避免此警告的任何想法也将不胜感激。
的[从数据框中设置上的一个切片的副本值(可能重复http://stackoverflow.com/questions/31468176/setting-values-on-a -copy对的一排从 - 一个非数据帧) – firelynx