2016-03-04 109 views
2

我有购买数据并希望用一个新列标记它们,它提供有关购买日期的信息。为此,我使用每次购买的时间戳列的小时。通过Pandas DataFrame迭代,使用条件并添加列

标签应该像这样工作:

hour 4 - 7 => 'morning' 
hour 8 - 11 => 'before midday' 
... 

我拿起已经时间戳的时间。现在,我有一个DataFrame,其中包含50 mio的记录,如下所示。

user_id timestamp    hour 
0 11  2015-08-21 06:42:44 6 
1 11  2015-08-20 13:38:58 13 
2 11  2015-08-20 13:37:47 13 
3 11  2015-08-21 06:59:05 6 
4 11  2015-08-20 13:15:21 13 

目前我的方法是使用6X .iterrows(),每一个不同的状态:

for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows(): 
    basket_times['periode'] = 'morning' 

则:

for index, row in basket_times[(basket_times['hour'] >= 8) & (basket_times['hour'] < 12)].iterrows(): 
    basket_times['periode'] = 'before midday' 

等。

但是,50个mio记录的6个循环中的一个已经花费了一个小时。有一个更好的方法吗?

回答

1

您可以定义一个函数的n将时间段映射到您想要的字符串,然后使用map

def get_periode(hour): 
    if 4 <= hour <= 7: 
     return 'morning' 
    elif 8 <= hour <= 11: 
     return 'before midday' 

basket_times['periode'] = basket_times['hour'].map(get_periode) 
+0

作品完美!我也发现,我的方法根本不起作用。 –

0

您可以尝试使用布尔型掩码loc。我改变df来进行测试:

print basket_times 
    user_id   timestamp hour 
0  11 2015-08-21 06:42:44  6 
1  11 2015-08-20 13:38:58 13 
2  11 2015-08-20 09:37:47  9 
3  11 2015-08-21 06:59:05  6 
4  11 2015-08-20 13:15:21 13 

#create boolean masks 
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8) 
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11) 
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15) 
print morning 
0  True 
1 False 
2 False 
3  True 
4 False 
Name: hour, dtype: bool 

print beforemidday 
0 False 
1 False 
2  True 
3 False 
4 False 
Name: hour, dtype: bool 
print aftermidday 
0 False 
1  True 
2 False 
3 False 
4  True 
Name: hour, dtype: bool 
basket_times.loc[morning, 'periode'] = 'morning' 
basket_times.loc[beforemidday, 'periode'] = 'before midday' 
basket_times.loc[aftermidday, 'periode'] = 'after midday' 
print basket_times 
    user_id   timestamp hour  periode 
0  11 2015-08-21 06:42:44  6  morning 
1  11 2015-08-20 13:38:58 13 after midday 
2  11 2015-08-20 09:37:47  9 before midday 
3  11 2015-08-21 06:59:05  6  morning 
4  11 2015-08-20 13:15:21 13 after midday 

时序 - len(df) = 500k

In [87]: %timeit a(df) 
10 loops, best of 3: 34 ms per loop 

In [88]: %timeit b(df1) 
1 loops, best of 3: 490 ms per loop 

代码来进行测试:

import pandas as pd 
import io 

temp=u"""user_id;timestamp;hour 
11;2015-08-21 06:42:44;6 
11;2015-08-20 10:38:58;10 
11;2015-08-20 09:37:47;9 
11;2015-08-21 06:59:05;6 
11;2015-08-20 10:15:21;10""" 
#after testing replace io.StringIO(temp) to filename 
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1]) 
df = pd.concat([df]*100000).reset_index(drop=True) 
print df.shape 
#(500000, 3) 
df1 = df.copy() 

def a(basket_times): 
    morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8) 
    beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11) 
    basket_times.loc[morning, 'periode'] = 'morning' 
    basket_times.loc[beforemidday, 'periode'] = 'before midday' 
    return basket_times 

def b(basket_times): 
    def get_periode(hour): 
     if 4 <= hour <= 7: 
      return 'morning' 
     elif 8 <= hour <= 11: 
      return 'before midday' 

    basket_times['periode'] = basket_times['hour'].map(get_periode) 
    return basket_times 

print a(df)  
print b(df1)  
相关问题