PANDAS时间序列窗口标签

我目前有一个窗口时间序列数据窗口的过程，但我想知道是否存在性能/资源原因的矢量化就地方法。PANDAS时间序列窗口标签

我有有第30天窗口的开始和结束日期两个列表：

start_dts = [2014年1月1日，...] end_dts = [2014年1月30日，... ]

我有一个名为'transaction_dt'的字段的数据框。

我试图完成的是当transaction_dt位于一对“start_dt”和“end_dt”值之间时，向每一行添加两个新列（'start_dt'和'end_dt'）的方法。理想情况下，如果可能的话，这将是矢量化的。

编辑：

如这里要求是我的格式一些示例数据：

'customer_id','transaction_dt','product','price','units' 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-01-29,thing2,150,25

来源

2017-10-05 Pylander

添加您的示例数据 – Wen

@Wen我已根据要求以我的格式添加了示例数据。谢谢！ – Pylander

检查我的答案 – Wen

IIUC

通过起诉IntervalIndex

df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both') 
df[['End','Start']]=df2.loc[df['transaction_dt']].values 


df 
Out[457]: 
    transaction_dt  End  Start 
0  2017-01-02 2017-01-31 2017-01-01 
1  2017-03-02 2017-03-31 2017-03-01 
2  2017-04-02 2017-04-30 2017-04-01 
3  2017-05-02 2017-05-31 2017-05-01

数据输入：

df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']}) 
df['transaction_dt']=pd.to_datetime(df['transaction_dt']) 
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01'] 
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31'] 
df2=pd.DataFrame({'Start':list1,'End':list2}) 
df2.Start=pd.to_datetime(df2.Start) 
df2.End=pd.to_datetime(df2.End)

来源

2017-10-05 16:37:46 Wen

我得到一个与错误值相关的KeyError，以及之前在错误中回调与“无法用多维键索引”相关的嵌套元组切片 – Pylander

如果你想开始和结束，我们可以利用这一点，Extracting the first day of month of a datetime type column in pandas：

import io 
import pandas as pd 
import datetime 

string = """customer_id,transaction_dt,product,price,units 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-01-29,thing2,150,25""" 

df = pd.read_csv(io.StringIO(string)) 

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"]) 

df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1) 
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1) 

df

customer_id transaction_dt product price units start end 
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31 
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31 
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31

新方法：

import io 
import pandas as pd 
import datetime 

string = """customer_id,transaction_dt,product,price,units 
1,2004-01-02,thing1,25,47 
1,2004-01-17,thing2,150,8 
2,2004-06-29,thing2,150,25""" 

df = pd.read_csv(io.StringIO(string)) 

df["transaction_dt"] = pd.to_datetime(df["transaction_dt"]) 

# Get all timestamps that are necessary 
# This assumes dates are sorted 
# if not we should change [0] -> min_dt and [-1] --> max_dt 
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)] 
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]: 
    timestamps.append(timestamps[-1]+datetime.timedelta(days=30)) 

# We store all ranges here 
ranges = list(zip(timestamps,timestamps[1:])) 

# Loop through all values and add to column start and end 
for ind,value in enumerate(df["transaction_dt"]): 
    for i,(start,end) in enumerate(ranges): 
     if (value >= start and value <= end): 
      df.loc[ind, "start"] = start 
      df.loc[ind, "end"] = end 
      # When match is found let's also 
      # remove all ranges that aren't met 
      # This can be removed if dates are not sorted 
      # But this should speed things up for large datasets 
      for _ in range(i): 
       ranges.pop(0)

来源

2017-10-05 17:03:23

这是一个好方法。不幸的是，我需要他们正好是30天的窗户，而不是每月。 – Pylander

@Pylander我重写了代码，但不能告诉你它有多高效：D –

PANDAS时间序列窗口标签

回答

相关问题