2016-09-26 60 views
7

我需要帮助转换我的数据,以便我可以读取事务数据。根据列中的条件创建组/类别

商业案例

我想组一起一定的关联交易,以创建活动的一些群体或阶层。这个数据集代表了工作人员出席各种缺席活动。我想根据离开事件类365天内的任何交易创建一类叶子。为了绘制趋势图,我想给这些类编号,以便得到一个序列/模式。

我的代码允许我查看第一个事件发生的时间,它可以识别新类何时开始,但不会将每个事务分为一个类。

要求:

  • 标签的所有行依据是什么让他们班落入。
  • 为每个唯一的离开事件编号。使用该实施例中索引0将是独特的假事件2,索引1将是独特的假事件2,索引3将是独特的假事件2和索引4将是独特的假事件1等

我加在所需输出的列中标记为“期望输出”。请注意,每个人可以有更多的行/事件;而且可能会有更多的人。

一些数据

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

一些代码,我已经试过

df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]') 
df['EmplidShift'] = df['Employee ID'].shift(-1) 
df['Effdt-Shift'] = df['Effective Date'].shift(-1) 
df['Prior Row in Same Emplid Class'] = "No" 
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date'] 
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]') 
df['Cumul. Count'] = df.groupby('Employee ID').cumcount() 


df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max') 
df['First Row Appears?'] = "" 
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row" 
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes" 

df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes" 

df['Effdt > 1 Yr?'] = ""           
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes") & (df['Effdt Diff'] < -365)) ] = "Yes" 

df['Unique Leave Event'] = "" 
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event" 

df 

回答

2

你可以做到这一点,而不必循环或遍历你的数据框。根据Wes McKinney,您可以将.apply()与groupBy对象一起使用,并定义一个应用于groupby对象的函数。如果您使用.shift()like here),您可以在不使用任何循环的情况下得到结果。

简洁例如:

# Group by Employee ID 
grouped = df.groupby("Employee ID") 
# Define function 
def get_unique_events(group): 
    # Convert to date and sort by date, like @Khris did 
    group["Effective Date"] = pd.to_datetime(group["Effective Date"]) 
    group = group.sort_values("Effective Date") 
    event_series = (group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')).apply(lambda x: int(x)).cumsum()+1 
    return event_series 

event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0) 
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True) 
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x)) 
df['Match'] = df['Desired Output'] == df['Output'] 

print(df) 

输出:

Employee ID Effective Date  Desired Output Unique Event \ 
3   100  2013-01-01 Unique Leave Event 1    1 
2   100  2014-07-01 Unique Leave Event 2    2 
1   100  2015-06-05 Unique Leave Event 2    2 
0   100  2016-01-01 Unique Leave Event 2    2 
6   200  2013-01-01 Unique Leave Event 1    1 
5   200  2015-01-01 Unique Leave Event 2    2 
4   200  2016-01-01 Unique Leave Event 2    2 
7   300  2014-01 Unique Leave Event 1    1 

       Output Match 
3 Unique Leave Event 1 True 
2 Unique Leave Event 2 True 
1 Unique Leave Event 2 True 
0 Unique Leave Event 2 True 
6 Unique Leave Event 1 True 
5 Unique Leave Event 2 True 
4 Unique Leave Event 2 True 
7 Unique Leave Event 1 True 

为了清楚更详细的例如:

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

# Group by Employee ID 
grouped = df.groupby("Employee ID") 

# Define a function to get the unique events 
def get_unique_events(group): 
    # Convert to date and sort by date, like @Khris did 
    group["Effective Date"] = pd.to_datetime(group["Effective Date"]) 
    group = group.sort_values("Effective Date") 
    # Define a series of booleans to determine whether the time between dates is over 365 days 
    # Use .shift(1) to look back one row 
    is_year = group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days') 
    # Convert booleans to integers (0 for False, 1 for True) 
    is_year_int = is_year.apply(lambda x: int(x))  
    # Use the cumulative sum function in pandas to get the cumulative adjustment from the first date. 
    # Add one to start the first event as 1 instead of 0 
    event_series = is_year_int.cumsum() + 1 
    return event_series 

# Run function on df and put results into a new dataframe 
# Convert Employee ID back from an index to a column with .reset_index(level=0) 
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0) 

# Merge the dataframes 
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True) 

# Add string to match desired format 
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x)) 

# Check to see if output matches desired output 
df['Match'] = df['Desired Output'] == df['Output'] 

print(df) 

您可以得到相同的输出:

Employee ID Effective Date  Desired Output Unique Event \ 
3   100  2013-01-01 Unique Leave Event 1    1 
2   100  2014-07-01 Unique Leave Event 2    2 
1   100  2015-06-05 Unique Leave Event 2    2 
0   100  2016-01-01 Unique Leave Event 2    2 
6   200  2013-01-01 Unique Leave Event 1    1 
5   200  2015-01-01 Unique Leave Event 2    2 
4   200  2016-01-01 Unique Leave Event 2    2 
7   300  2014-01 Unique Leave Event 1    1 

       Output Match 
3 Unique Leave Event 1 True 
2 Unique Leave Event 2 True 
1 Unique Leave Event 2 True 
0 Unique Leave Event 2 True 
6 Unique Leave Event 1 True 
5 Unique Leave Event 2 True 
4 Unique Leave Event 2 True 
7 Unique Leave Event 1 True 
+0

这是一个优雅的解决方案。如果OP使用真正巨大的数据帧,唯一的危险可能在于“合并”,但从数据内容来看,这不太可能。 – Khris

3

这是一个有点笨重,但它产生正确的输出至少为你的小例子:

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

df["Effective Date"] = pd.to_datetime(df["Effective Date"]) 
df = df.sort_values(["Employee ID","Effective Date"]).reset_index(drop=True) 

for i,_ in df.iterrows(): 
    df.ix[0,"Result"] = "Unique Leave Event 1" 
    if i < len(df)-1: 
    if df.ix[i+1,"Employee ID"] == df.ix[i,"Employee ID"]: 
     if df.ix[i+1,"Effective Date"] - df.ix[i,"Effective Date"] > pd.Timedelta('365 days'): 
     df.ix[i+1,"Result"] = "Unique Leave Event " + str(int(df.ix[i,"Result"].split()[-1])+1) 
     else: 
     df.ix[i+1,"Result"] = df.ix[i,"Result"] 
    else: 
     df.ix[i+1,"Result"] = "Unique Leave Event 1" 

备注该代码假定第一行始终包含字符串Unique Leave Event 1

编辑:一些解释。

首先,我将日期转换为日期时间格式,然后重新排序数据框,以便每个员工ID的日期都是递增的。

然后我使用内置int迭代器iterrows迭代帧的行。在for i,_中的_仅仅是我不使用的第二个变量的占位符,因为迭代器同时返回行号和行,我只需要这里的数字。

在迭代器中,我正在进行按行比较,所以默认情况下我手动填充第一行,然后分配给第i+1行。我这样做是因为我知道第一行的值,而不是最后一行的值。然后我比较i+1-行与i-0123fe-safe内的第012行,因为i+1会在最后一次迭代中给出索引错误。

在循环中,我首先检查Employee ID是否在两行之间发生了变化。如果没有,那么我比较两行的日期,看看它们是否分开超过365天。如果是这种情况,我从i行读取字符串"Unique Leave Event X",将数字增加1并将其写入i+1 -row。如果日期更近,我只需复制前一行的字符串。

如果Employee ID确实改变另一方面,我只写"Unique Leave Event 1"重新开始。

注1:iterrows()没有设置选项,所以我不能只遍历子集。注意2:总是使用其中一个内置迭代器进行迭代,只有在不能解决问题时才进行迭代。注意3:在迭代中分配值时,始终使用ix,lociloc

+0

谢谢!你能否提供一些关于你如何做到的评论? – Christopher

+0

嗨,抱歉等了很长时间,我只在这里评论工作,我们有一个为期三天的周末。我现在会添加一些评论。 – Khris