我需要帮助转换我的数据,以便我可以读取事务数据。根据列中的条件创建组/类别
商业案例
我想组一起一定的关联交易,以创建活动的一些群体或阶层。这个数据集代表了工作人员出席各种缺席活动。我想根据离开事件类365天内的任何交易创建一类叶子。为了绘制趋势图,我想给这些类编号,以便得到一个序列/模式。
我的代码允许我查看第一个事件发生的时间,它可以识别新类何时开始,但不会将每个事务分为一个类。
要求:
- 标签的所有行依据是什么让他们班落入。
- 为每个唯一的离开事件编号。使用该实施例中索引0将是独特的假事件2,索引1将是独特的假事件2,索引3将是独特的假事件2和索引4将是独特的假事件1等
我加在所需输出的列中标记为“期望输出”。请注意,每个人可以有更多的行/事件;而且可能会有更多的人。
一些数据
import pandas as pd
data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])
一些代码,我已经试过
df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()
df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Effdt > 1 Yr?'] = ""
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes") & (df['Effdt Diff'] < -365)) ] = "Yes"
df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event"
df
这是一个优雅的解决方案。如果OP使用真正巨大的数据帧,唯一的危险可能在于“合并”,但从数据内容来看,这不太可能。 – Khris