2015-10-18 46 views
1

在本例中,我们有两天的数据采样时间为1分钟,给我们2880次测量。测量结果依次收集到多个时区:欧洲/伦敦的前240分钟以及'America/Los_Angeles'中剩余的2640个测量值。根据当地时间(HH:MM)计算24小时周期内的每分钟平均销售量

import pandas as pd 
import numpy as np 
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC'))) 
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London' 
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles' 
df['sales1']=np.random.random_integers(100,size=len(df)) 
df['sales2']=np.random.random_integers(10,size=len(df)) 

要计算多日的24小时周期平均销售每分钟(根据UTC时间)以下方法效果很好:

utc_sales=df.groupby([df.index.hour,df.index.minute]).mean() 
utc_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True) 

这GROUPBY方法也可以应用于计算基于其他两个时区之一的平均销售额,例如“欧洲/伦敦”。

df['London']=df.index.tz_convert('Europe/London') 
london_sales=df.groupby([df['London'].dt.hour,df['London'].dt.minute]).mean() 
london_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True) 

但是我挣扎拿出一个24小时的周期计算每分钟每-as平均localtime-销售的有效方式。我尝试了上面的相同方法,但是当同一系列中存在多个时区时,groupby将恢复到utc中的索引。

def calculate_localtime(x): 
    return pd.to_datetime(x.name,unit='s').tz_convert(x['timezone']) 
df['localtime']=df.apply(calculate_localtime,axis=1) 
local_sales=df.groupby([df['localtime'].dt.hour,df['localtime'].dt.minute]).mean() 
local_sales.set_index(pd.date_range("00:00","23:59",freq="1min").time,inplace=True) 

我们可以验证local_sales是否与utc_sales相同,因此此方法不起作用。

In [8]: np.unique(local_sales == utc_sales) 
Out[8]: array([ True], dtype=bool) 

任何人都可以推荐一种适用于大型数据集和多个时区的方法吗?

回答

2

这是一种获得我想要的东西的方法。这就要求大熊猫0.17.0或更新

创建数据,你根据时区有aboe

import pandas as pd 
import numpy as np 

pd.options.display.max_rows=12 
np.random.seed(1234) 
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC'))) 
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London' 
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles' 
df['sales1']=np.random.random_integers(100,size=len(df)) 
df['sales2']=np.random.random_integers(10,size=len(df)) 

In [79]: df 
Out[79]: 
             timezone sales1 sales2 
2015-03-29 00:00:00+00:00  Europe/London  48  6 
2015-03-29 00:01:00+00:00  Europe/London  84  1 
2015-03-29 00:02:00+00:00  Europe/London  39  1 
2015-03-29 00:03:00+00:00  Europe/London  54  10 
2015-03-29 00:04:00+00:00  Europe/London  77  5 
2015-03-29 00:05:00+00:00  Europe/London  25  9 
...          ...  ...  ... 
2015-03-30 23:54:00+00:00 America/Los_Angeles  77  8 
2015-03-30 23:55:00+00:00 America/Los_Angeles  16  4 
2015-03-30 23:56:00+00:00 America/Los_Angeles  55  3 
2015-03-30 23:57:00+00:00 America/Los_Angeles  18  1 
2015-03-30 23:58:00+00:00 America/Los_Angeles  3  2 
2015-03-30 23:59:00+00:00 America/Los_Angeles  52  2 

[2880 rows x 3 columns] 

枢轴;这创建了与时区分开的多索引

x = pd.pivot_table(df.reset_index(),values=['sales1','sales2'],index='index',columns='timezone').swaplevel(0,1,axis=1) 
    x.columns.names = ['timezone','sales'] 

In [82]: x 
Out[82]: 
timezone     America/Los_Angeles Europe/London America/Los_Angeles Europe/London 
sales         sales1  sales1    sales2  sales2 
index                       
2015-03-29 00:00:00+00:00     NaN   48     NaN    6 
2015-03-29 00:01:00+00:00     NaN   84     NaN    1 
2015-03-29 00:02:00+00:00     NaN   39     NaN    1 
2015-03-29 00:03:00+00:00     NaN   54     NaN   10 
2015-03-29 00:04:00+00:00     NaN   77     NaN    5 
2015-03-29 00:05:00+00:00     NaN   25     NaN    9 
...          ...   ...     ...   ... 
2015-03-30 23:54:00+00:00     77   NaN     8   NaN 
2015-03-30 23:55:00+00:00     16   NaN     4   NaN 
2015-03-30 23:56:00+00:00     55   NaN     3   NaN 
2015-03-30 23:57:00+00:00     18   NaN     1   NaN 
2015-03-30 23:58:00+00:00     3   NaN     2   NaN 
2015-03-30 23:59:00+00:00     52   NaN     2   NaN 

[2880 rows x 4 columns] 

创建我们要使用的石斑,即本地区域中的小时和分钟。我们将根据面具IOW填充它们。其中sales1/sales2均为空,我们将使用该(本地)区域的小时数/分钟数

hours = pd.Series(index=x.index) 
minutes = pd.Series(index=x.index) 
for tz in ['America/Los_Angeles', 'Europe/London' ]: 

    local = df.index.tz_convert(tz) 
    x[(tz,'tz')] = local 

    mask = x[(tz,'sales1')].notnull() & x[(tz,'sales2')].notnull() 
    hours.iloc[mask.values] = local.hour[mask.values] 
    minutes.iloc[mask.values] = local.minute[mask.values] 

x = x.sortlevel(axis=1) 

之后。 (注意这可能有点简化,这意味着我们不需要实际记录本地时区,只需计算小时/分钟)。

Out[84]: 
timezone     America/Los_Angeles         Europe/London         
sales         sales1 sales2      tz  sales1 sales2      tz 
index                               
2015-03-29 00:00:00+00:00     NaN NaN 2015-03-28 17:00:00-07:00   48  6 2015-03-29 00:00:00+00:00 
2015-03-29 00:01:00+00:00     NaN NaN 2015-03-28 17:01:00-07:00   84  1 2015-03-29 00:01:00+00:00 
2015-03-29 00:02:00+00:00     NaN NaN 2015-03-28 17:02:00-07:00   39  1 2015-03-29 00:02:00+00:00 
2015-03-29 00:03:00+00:00     NaN NaN 2015-03-28 17:03:00-07:00   54  10 2015-03-29 00:03:00+00:00 
2015-03-29 00:04:00+00:00     NaN NaN 2015-03-28 17:04:00-07:00   77  5 2015-03-29 00:04:00+00:00 
2015-03-29 00:05:00+00:00     NaN NaN 2015-03-28 17:05:00-07:00   25  9 2015-03-29 00:05:00+00:00 
...          ... ...      ...   ... ...      ... 
2015-03-30 23:54:00+00:00     77  8 2015-03-30 16:54:00-07:00   NaN NaN 2015-03-31 00:54:00+01:00 
2015-03-30 23:55:00+00:00     16  4 2015-03-30 16:55:00-07:00   NaN NaN 2015-03-31 00:55:00+01:00 
2015-03-30 23:56:00+00:00     55  3 2015-03-30 16:56:00-07:00   NaN NaN 2015-03-31 00:56:00+01:00 
2015-03-30 23:57:00+00:00     18  1 2015-03-30 16:57:00-07:00   NaN NaN 2015-03-31 00:57:00+01:00 
2015-03-30 23:58:00+00:00     3  2 2015-03-30 16:58:00-07:00   NaN NaN 2015-03-31 00:58:00+01:00 
2015-03-30 23:59:00+00:00     52  2 2015-03-30 16:59:00-07:00   NaN NaN 2015-03-31 00:59:00+01:00 

[2880 rows x 6 columns] 

这使用时区的新表示法(在0.17.0中)。

In [85]: x.dtypes 
Out[85]: 
timezone    sales 
America/Los_Angeles sales1        float64 
        sales2        float64 
        tz  datetime64[ns, America/Los_Angeles] 
Europe/London  sales1        float64 
        sales2        float64 
        tz    datetime64[ns, Europe/London] 
dtype: object 

结果

x.groupby([hours,minutes]).mean() 

timezone America/Los_Angeles  Europe/London  
sales     sales1 sales2  sales1 sales2 
0 0     62.5 5.5   48  6 
    1     52.0 7.0   84  1 
    2     89.0 3.5   39  1 
    3     67.5 6.5   54  10 
    4     41.0 5.5   77  5 
    5     81.0 5.5   25  9 
...      ... ...   ... ... 
23 54     76.5 4.5   NaN NaN 
    55     37.5 5.0   NaN NaN 
    56     60.5 8.0   NaN NaN 
    57     87.5 7.0   NaN NaN 
    58     77.5 6.0   NaN NaN 
    59     31.0 5.5   NaN NaN 

[1440 rows x 4 columns]