2017-10-05 71 views
0

我目前有一个窗口时间序列数据窗口的过程,但我想知道是否存在性能/资源原因的矢量化就地方法。PANDAS时间序列窗口标签

我有有第30天窗口的开始和结束日期两个列表:

start_dts = [2014年1月1日,...] end_dts = [2014年1月30日,... ]

我有一个名为'transaction_dt'的字段的数据框。

我试图完成的是当transaction_dt位于一对“start_dt”和“end_dt”值之间时,向每一行添加两个新列('start_dt'和'end_dt')的方法。理想情况下,如果可能的话,这将是矢量化的。

编辑:

如这里要求是我的格式一些示例数据:

'customer_id','transaction_dt','product','price','units' 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-01-29,thing2,150,25 
+0

添加您的示例数据 – Wen

+0

@Wen我已根据要求以我的格式添加了示例数据。谢谢! – Pylander

+0

检查我的答案 – Wen

回答

0

IIUC

通过起诉IntervalIndex

df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both') 
df[['End','Start']]=df2.loc[df['transaction_dt']].values 


df 
Out[457]: 
    transaction_dt  End  Start 
0  2017-01-02 2017-01-31 2017-01-01 
1  2017-03-02 2017-03-31 2017-03-01 
2  2017-04-02 2017-04-30 2017-04-01 
3  2017-05-02 2017-05-31 2017-05-01 

数据输入:

df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']}) 
df['transaction_dt']=pd.to_datetime(df['transaction_dt']) 
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01'] 
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31'] 
df2=pd.DataFrame({'Start':list1,'End':list2}) 
df2.Start=pd.to_datetime(df2.Start) 
df2.End=pd.to_datetime(df2.End) 
+0

我得到一个与错误值相关的KeyError,以及之前在错误中回调与“无法用多维键索引”相关的嵌套元组切片 – Pylander

0

如果你想开始和结束,我们可以利用这一点,Extracting the first day of month of a datetime type column in pandas

import io 
import pandas as pd 
import datetime 

string = """customer_id,transaction_dt,product,price,units 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-01-29,thing2,150,25""" 

df = pd.read_csv(io.StringIO(string)) 

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"]) 

df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1) 
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1) 

df 

返回

customer_id transaction_dt product price units start end 
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31 
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31 
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31 

新方法

import io 
import pandas as pd 
import datetime 

string = """customer_id,transaction_dt,product,price,units 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-06-29,thing2,150,25""" 

df = pd.read_csv(io.StringIO(string)) 

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"]) 

# Get all timestamps that are necessary 
# This assumes dates are sorted 
# if not we should change [0] -> min_dt and [-1] --> max_dt 
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)] 
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]: 
    timestamps.append(timestamps[-1]+datetime.timedelta(days=30)) 

# We store all ranges here 
ranges = list(zip(timestamps,timestamps[1:])) 

# Loop through all values and add to column start and end 
for ind,value in enumerate(df["transaction_dt"]): 
    for i,(start,end) in enumerate(ranges): 
     if (value >= start and value <= end): 
      df.loc[ind, "start"] = start 
      df.loc[ind, "end"] = end 
      # When match is found let's also 
      # remove all ranges that aren't met 
      # This can be removed if dates are not sorted 
      # But this should speed things up for large datasets 
      for _ in range(i): 
       ranges.pop(0) 
+0

这是一个好方法。不幸的是,我需要他们正好是30天的窗户,而不是每月。 – Pylander

+0

@Pylander我重写了代码,但不能告诉你它有多高效:D –