我有两个数据框 - 一个客户的呼叫和另一个识别活动的服务持续时间。每个客户可以有多个服务,但不会重叠。Pandas合并列之间的日期
df_calls = pd.DataFrame([['A','2016-02-03',1],['A','2016-05-11',2],['A','2016-10-01',3],['A','2016-11-02',4],
['B','2016-01-10',5],['B','2016-04-25',6]], columns = ['cust_id','call_date','call_id'])
print df_calls
cust_id call_date call_id
0 A 2016-02-03 1
1 A 2016-05-11 2
2 A 2016-10-01 3
3 A 2016-11-02 4
4 B 2016-01-10 5
5 B 2016-04-25 6
和
df_active = pd.DataFrame([['A','2016-01-10','2016-03-15',1],['A','2016-09-10','2016-11-15',2],
['B','2016-01-02','2016-03-17',3]], columns = ['cust_id','service_start','service_end','service_id'])
print df_active
cust_id service_start service_end service_id
0 A 2016-01-10 2016-03-15 1
1 A 2016-09-10 2016-11-15 2
2 B 2016-01-02 2016-03-17 3
我需要找到每个调用属于由SERVICE_START和service_end日期标识的的service_id。如果呼叫不在日期之间,则应保留在数据集中。
这里是我试过到目前为止:
df_test_output = pd.merge(df_calls,df_active, how = 'left',on = ['cust_id'])
df_test_output = df_test_output[(df_test_output['call_date']>= df_test_output['service_start'])
& (df_test_output['call_date']<= df_test_output['service_end'])].drop(['service_start','service_end'],axis = 1)
print df_test_output
cust_id call_date call_id service_id
0 A 2016-02-03 1 1
5 A 2016-10-01 3 2
7 A 2016-11-02 4 2
8 B 2016-01-10 5 3
这种下降是没有服务日期之间的所有呼叫。关于如何在满足条件的service_id上合并,但保留其余记录的想法?
结果应该是这样的:
#do black magic
print df_calls
cust_id call_date call_id service_id
0 A 2016-02-03 1 1.0
1 A 2016-05-11 2 NaN
2 A 2016-10-01 3 2.0
3 A 2016-11-02 4 2.0
4 B 2016-01-10 5 3.0
5 B 2016-04-25 6 NaN
您可以加入'df_calls2'用'df_calls'上'call_id' –