2016-11-16 128 views
3

我有两个数据框 - 一个客户的呼叫和另一个识别活动的服务持续时间。每个客户可以有多个服务,但不会重叠。Pandas合并列之间的日期

df_calls = pd.DataFrame([['A','2016-02-03',1],['A','2016-05-11',2],['A','2016-10-01',3],['A','2016-11-02',4], 
         ['B','2016-01-10',5],['B','2016-04-25',6]], columns = ['cust_id','call_date','call_id']) 

print df_calls 

    cust_id call_date call_id 
0  A 2016-02-03  1 
1  A 2016-05-11  2 
2  A 2016-10-01  3 
3  A 2016-11-02  4 
4  B 2016-01-10  5 
5  B 2016-04-25  6 

df_active = pd.DataFrame([['A','2016-01-10','2016-03-15',1],['A','2016-09-10','2016-11-15',2], 
          ['B','2016-01-02','2016-03-17',3]], columns = ['cust_id','service_start','service_end','service_id']) 


print df_active 

    cust_id service_start service_end service_id 
0  A 2016-01-10 2016-03-15   1 
1  A 2016-09-10 2016-11-15   2 
2  B 2016-01-02 2016-03-17   3 

我需要找到每个调用属于由SERVICE_START和service_end日期标识的的service_id。如果呼叫不在日期之间,则应保留在数据集中。

这里是我试过到目前为止:

df_test_output = pd.merge(df_calls,df_active, how = 'left',on = ['cust_id']) 
df_test_output = df_test_output[(df_test_output['call_date']>= df_test_output['service_start']) 
         & (df_test_output['call_date']<= df_test_output['service_end'])].drop(['service_start','service_end'],axis = 1) 

print df_test_output 

    cust_id call_date call_id service_id 
0  A 2016-02-03  1   1 
5  A 2016-10-01  3   2 
7  A 2016-11-02  4   2 
8  B 2016-01-10  5   3 

这种下降是没有服务日期之间的所有呼叫。关于如何在满足条件的service_id上​​合并,但保留其余记录的想法?

结果应该是这样的:

#do black magic 

print df_calls 

cust_id call_date call_id service_id 
0  A 2016-02-03  1   1.0 
1  A 2016-05-11  2   NaN 
2  A 2016-10-01  3   2.0 
3  A 2016-11-02  4   2.0 
4  B 2016-01-10  5   3.0 
5  B 2016-04-25  6   NaN 
+1

您可以加入'df_calls2'用'df_calls'上'call_id' –

回答

3

您可以使用merge与左连接:

print (pd.merge(df_calls, df_calls2, how='left')) 
    cust_id call_date call_id service_id 
0  A 2016-02-03  1   1.0 
1  A 2016-05-11  2   NaN 
2  A 2016-10-01  3   2.0 
3  A 2016-11-02  4   2.0 
4  B 2016-01-10  5   3.0 
5  B 2016-04-25  6   NaN 
+0

df_calls2 ISN”真正的桌子。这是合并df_calls和df_service然后删除愚蠢的输出。它的创建表明我尝试的方法不起作用。 – flyingmeatball

+0

嗯,你认为它可行,但找到更好的解决方案? – jezrael

+0

啊gotcha - 我看到你在说什么,那是行得通的,谢谢!我一直在探索使用图https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csgraph.connected_components.html – flyingmeatball