2016-01-22 43 views
2

从每天不同时间测量的多年温度记录开始,我想最终得到一个矩形的日平均数列,每行代表一年的数据。对齐日常数据

的数据看起来像这样

temperature.head() 

date 
1996-01-01 00:00:00  7.39 
1996-01-01 03:00:00  6.60 
1996-01-01 06:00:00  7.39 
1996-01-01 09:00:00  9.50 
1996-01-01 12:00:00 11.00 
Name: temperature, dtype: float64 

我计算每日平均值与

import pandas as pd 
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean() 

其中产量现在

daily.head() 

date 
1996-01-01  9.89625 
1996-01-02 10.73625 
1996-01-03  6.98500 
1996-01-04  5.62250 
1996-01-05  8.84625 
Freq: D, Name: temperature, dtype: float64 

的最后一部分,我想到了什么样

yearly_daily_mean = daily.groupby(pd.TimeGrouper(freq='12M', closed="left")) 

但这里有一些问题。

  1. 我需要删除未填满一年的数据的尾部。
  2. 如果缺少数据会怎么样?
  3. 如何应对闰年?
  4. 下一步是什么?也就是说,如何“堆叠”(以numpy的方式,而不是熊猫的感觉)多年的数据?

我使用

array_temperature = np.column_stack([group[1] for group in yearly_daily_mean if len(group[1]) == 365]) 

但应该有一个更好的办法。

作为一个附属问题,我该如何选择多年数据的开始日期?

+1

你可以添加一个如何得到想要的结果的例子吗? (理想情况下,一些示例数据给出了一个可重复的例子) – joris

回答

4

如果我正确理解了你的意思,你需要将你的时间序列重新组合成一个矩形的数据框,其中不同日期为列,不同年份为行。
这可以方便地与大熊猫重塑功能,例如与pivot来实现:

一些假数据:

In [45]: index = pd.date_range(start=date(1996, 1,1), end=date(2010, 6, 30), freq='D') 

In [46]: daily = pd.DataFrame(index=index, data=np.random.random(size=len(index)), columns=['temperature']) 

首先,我增加一列与去年的年和日:

In [47]: daily['year'] = daily.index.year 

In [48]: daily['day'] = daily.index.dayofyear 

In [49]: daily.head() 
Out[49]: 
      temperature year day 
1996-01-01  0.081774 1996 1 
1996-01-02  0.694968 1996 2 
1996-01-03  0.478050 1996 3 
1996-01-04  0.123844 1996 4 
1996-01-05  0.426150 1996 5 

现在,我们可以重塑这个数据帧:

In [50]: daily.pivot(index='year', columns='day', values='temperature') 
Out[50]: 
day  1   2  ...   365  366 
year      ... 
1996 0.081774 0.694968 ...  0.679461 0.700833 
1997 0.043134 0.981707 ...  0.009357  NaN 
1998 0.257077 0.297290 ...  0.701941  NaN 
...  ...  ... ...   ...  ... 
2008 0.047145 0.750354 ...  0.996396 0.761159 
2009 0.348667 0.827057 ...  0.881424  NaN 
2010 0.269743 0.872655 ...   NaN  NaN 

[15 rows x 366 columns] 
1

从高频数据假设你已经有每日平均值(与pd.DateTimeIndex)作为的结果:IIUC

daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean() 

,要日均用相等数量的列转换成DataFrame每行捕获年度数据。瞄准相同数量的列时,你提到闰年是一个潜在的问题。

我能想象的要对此有两种方式:

  1. 选择一个号码每row天的 - 的连续365个每天的数据点,每个row大概365选择滚动块和index为每对准这些块。
  2. 选择多年的数据,填补闰年的空白,并根据MM-DD或一年中的日期数来调整。

与20年半的日常随机data为模拟日平均气温开始:

index = pd.date_range(start=date(1995, 1,1), end=date(2015, 6, 30), freq='D') 
df = pd.DataFrame(index=index, data=np.random.random(size=len(index)) * 30, columns=['temperature']) 

df.info() 

<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 7486 entries, 1995-01-01 to 2015-06-30 
Freq: D 
Data columns (total 1 columns): 
temperature 7486 non-null float64 
dtypes: float64(1) 
memory usage: 117.0 KB 
None 

df.head() 

      temperature 
1995-01-01  4.119212 
1995-01-02 27.107131 
1995-01-03 26.704931 
1995-01-04  7.430203 
1995-01-05  4.230398 

df.tail() 

      temperature 
2015-06-26 10.902779 
2015-06-27  8.494378 
2015-06-28 17.800131 
2015-06-29 19.543815 
2015-06-30 16.390435 

这里的第一种方法的解决方案:使用.groupby(pd.TimeGrouper('365D'))

连续365天选择块,并将每个得到的groupby每日平均值的对象作为pd.DataFrameintegerindex返回,该值从0364对于每个s层序:

aligned = df.groupby(pd.TimeGrouper(freq='365D')).apply(lambda x: pd.DataFrame(x.squeeze().tolist())) # .squeeze() converts single columns `DataFrame` to pd.Series 

要对齐数据的21块,只需调换pd.DataFrame,他们将通过integer指数在columns, with the start date of each sequence in the指数. This operation will produce an extra指数对准, and the last row`将有一些丢失的数据。清理都是具有:

aligned.dropna().reset_index(-1, drop=True) 

获得[20 x 365]DataFrame如下:

DatetimeIndex: 20 entries, 1995-01-01 to 2013-12-27 
Freq: 365D 
Columns: 365 entries, 0 to 364 
dtypes: float64(365) 
memory usage: 57.2 KB 

        0   1   2   3   4   5 \ 
1995-01-01 29.456090 25.313968 4.146206 5.347690 25.767425 11.978152 
1996-01-01 25.585481 26.846486 8.336905 16.749842 6.247542 17.723733 
1996-12-31 23.410462 10.168599 5.601917 11.996500 8.650726 23.362815 
1997-12-31 7.586873 23.882106 22.145595 3.287160 21.642547 1.949321 
1998-12-31 14.691420 3.611475 28.287327 25.347787 13.291708 20.571616 
1999-12-31 25.713866 17.588570 18.562117 19.420944 12.406293 11.870750 
2000-12-30 5.099561 17.894763 21.168223 4.786461 24.521417 21.443607 
2001-12-30 11.791223 8.352493 12.731769 0.459697 20.680396 27.554783 
2002-12-30 3.785876 0.359850 20.828764 15.376991 14.086626 0.477615 
2003-12-30 23.633243 12.726250 8.197824 16.355956 8.094145 1.410746 
2004-12-29 1.139949 4.161267 9.043062 14.109888 13.538735 1.566002 
2005-12-29 25.504224 19.346419 3.300641 26.933084 23.634321 18.323450 
2006-12-29 10.535785 9.168498 27.222106 11.962343 10.004678 23.893257 
2007-12-29 27.482856 6.910670 6.033291 12.673530 26.362971 4.492178 
2008-12-28 11.152316 25.233664 22.124299 11.012285 1.992814 25.542204 
2009-12-28 23.131021 16.363467 1.242393 10.387653 4.858851 26.553950 
2010-12-28 13.134843 9.195658 19.075850 28.539387 3.075934 8.089347 
2011-12-28 28.860275 10.121573 0.663906 19.687892 29.376377 11.488446 
2012-12-27 7.644073 19.649330 25.497595 6.592940 8.879444 17.733670 
2013-12-27 11.713996 2.602284 3.835302 22.244623 27.279810 14.144943 

        6   7   8   9  ...   355 \ 
1995-01-01 8.210005 8.129146 28.798472 25.646924 ...  24.177163 
1996-01-01 0.481487 16.772357 3.934185 22.640157 ...  23.340931 
1996-12-31 10.813812 16.276504 3.422665 14.916229 ...  13.817015 
1997-12-31 19.184753 28.628326 22.134871 12.721064 ...  23.905483 
1998-12-31 2.839492 7.889141 17.951959 25.233585 ...  28.002751 
1999-12-31 6.958672 26.335427 23.361470 5.911806 ...  7.778412 
2000-12-30 8.405042 25.229016 19.746462 15.332004 ...  5.703830 
2001-12-30 0.558788 15.457327 20.987186 25.452723 ...  29.771372 
2002-12-30 19.002685 26.455754 25.468178 25.383786 ...  14.238987 
2003-12-30 22.984328 15.934398 25.361599 12.221306 ...  1.189949 
2004-12-29 22.121901 21.421103 26.175702 16.040881 ...  19.945408 
2005-12-29 2.557901 15.193412 27.049389 4.825570 ...  7.629859 
2006-12-29 8.582602 26.037375 0.933591 13.469771 ...  29.453932 
2007-12-29 29.437921 26.470153 9.917871 16.875801 ...  5.702116 
2008-12-28 3.809633 10.583385 18.029571 0.440077 ...  11.337894 
2009-12-28 24.406696 28.294553 19.929563 4.683991 ...  25.697446 
2010-12-28 29.765551 16.716723 6.467946 10.998447 ...  26.988863 
2011-12-28 28.962746 11.407137 9.957111 4.502521 ...  14.606937 
2012-12-27 1.374502 5.571244 11.212960 9.949830 ...  23.345868 
2013-12-27 26.373866 4.781510 16.828510 10.280078 ...  0.552726 

        356  357  358  359  360  361 \ 
1995-01-01 13.511951 10.126835 28.121730 23.275360 11.785242 27.907039 
1996-01-01 13.362737 14.336780 24.114908 28.479688 8.509069 17.408937 
1996-12-31 19.192674 1.146844 27.499688 7.090407 2.777819 22.826814 
1997-12-31 21.502186 10.495148 21.786895 12.229181 8.068271 6.522108 
1998-12-31 21.338355 11.978265 9.186161 21.053924 3.033370 29.934703 
1999-12-31 5.960120 20.325684 0.915052 15.059979 12.194240 20.138567 
2000-12-30 11.883186 2.764768 27.324304 29.630706 21.852058 20.416199 
2001-12-30 7.802891 25.384479 9.044486 8.809446 7.606603 6.051890 
2002-12-30 7.362494 8.940783 5.259984 7.035818 24.094134 7.197113 
2003-12-30 25.596902 9.756372 6.345198 1.520188 22.752717 3.470268 
2004-12-29 26.789064 9.708466 18.287838 21.134643 29.862135 19.926086 
2005-12-29 26.398394 24.717514 16.606042 28.189245 24.574806 14.297410 
2006-12-29 8.795342 18.019536 16.579878 20.368811 22.052442 26.393676 
2007-12-29 8.696240 25.901889 16.410934 15.274897 14.365867 10.523388 
2008-12-28 18.581513 25.974784 21.025297 10.521118 5.864974 2.373023 
2009-12-28 14.437944 21.717456 4.017870 14.024522 0.959989 17.215403 
2010-12-28 11.426540 13.751451 4.664761 15.373878 7.731613 7.269089 
2011-12-28 1.952897 9.406866 28.957258 20.239517 11.156958 29.238761 
2012-12-27 7.588643 21.186675 17.348911 1.354323 13.918083 3.034123 
2013-12-27 22.916065 2.089675 22.832061 14.787841 25.697875 14.087893 

        362  363  364 
1995-01-01 13.107523 10.740551 20.511825 
1996-01-01 25.016219 17.885332 2.438875 
1996-12-31 24.692327 0.221760 6.749919 
1997-12-31 24.856169 0.930019 22.603652 
1998-12-31 18.361414 13.587695 25.161495 
1999-12-31 0.512120 26.482288 1.035197 
2000-12-30 15.401012 28.334219 5.965014 
2001-12-30 10.292213 10.951915 8.270319 
2002-12-30 21.945734 27.076438 6.795688 
2003-12-30 14.788929 19.456459 11.216835 
2004-12-29 7.086443 25.463503 17.549196 
2005-12-29 12.252487 29.081547 25.507369 
2006-12-29 0.012617 0.086186 17.421958 
2007-12-29 4.191633 21.588891 7.516187 
2008-12-28 26.194288 20.500256 24.876032 
2009-12-28 28.445254 27.338754 7.849899 
2010-12-28 28.888573 26.801262 23.117027 
2011-12-28 19.871547 20.324514 18.369134 
2012-12-27 15.907752 9.417700 4.922940 
2013-12-27 21.132385 20.707216 5.288128 

[20 rows x 365 columns] 

如果你想简单地收集年的数据,并按照日期排列,从而使非闰年有周围没有60失踪一天(而不是366),您可以:

df.groupby(pd.TimeGrouper(freq='A')).apply(lambda x: pd.DataFrame(x.squeeze().tolist()).T).reset_index(-1, drop=True).iloc 

        0   1   2   3   4   5 \ 
1995-12-31 1.245796 28.487530 0.574299 10.033485 19.221512 8.718728 
1996-12-31 12.258653 3.864652 25.237088 13.982809 24.494746 13.822292 
1997-12-31 22.239412 4.796824 21.389404 11.151171 25.577368 1.754948 
1998-12-31 24.968287 2.089894 25.888487 28.291714 19.115844 24.426285 
1999-12-31 9.285363 19.339405 26.012193 3.243394 25.176499 8.766770 
2000-12-31 26.996573 26.404391 1.793644 21.314488 13.118279 26.703532 
2001-12-31 16.303829 14.021771 20.828238 11.427195 3.099290 18.730795 
2002-12-31 14.614617 10.694258 5.226033 24.900849 17.395822 22.154202 
2003-12-31 10.564132 8.267639 7.778573 26.704936 5.671499 0.470963 
2004-12-31 22.649623 15.725867 18.445629 7.529507 11.868134 10.965534 
2005-12-31 2.406615 9.709624 23.284616 11.479254 23.814725 1.656826 
2006-12-31 19.164459 23.177769 16.091672 28.936777 28.636072 4.838555 
2007-12-31 12.371377 3.417582 21.067689 25.493921 25.410295 15.526614 
2008-12-31 29.080385 4.653984 16.567333 24.248921 27.338538 9.353291 
2009-12-31 29.608734 6.046593 22.738628 22.631714 26.061903 21.217846 
2010-12-31 27.458254 15.146497 18.917073 8.473955 26.782767 10.891648 
2011-12-31 25.433759 8.959650 14.343507 16.249726 17.031174 12.944418 
2012-12-31 22.940797 4.791280 11.765939 25.925645 3.649440 27.483407 
2013-12-31 11.684391 27.701678 27.423083 27.656086 9.374896 14.250936 
2014-12-31 23.660098 27.768960 25.753294 3.014606 23.330226 17.570492 

        6   7   8   9  ...   356 \ 
1995-12-31 17.079137 26.100763 12.376462 12.315219 ...  16.910185 
1996-12-31 26.718277 10.349412 12.940624 9.453769 ...  19.235435 
1997-12-31 20.201528 22.895552 1.443243 20.584140 ...  29.665815 
1998-12-31 21.493163 16.724328 5.946833 15.230762 ...  2.617883 
1999-12-31 9.776013 13.381424 11.028295 1.905501 ...  7.200409 
2000-12-31 9.773097 14.565345 22.578398 0.688273 ...  18.119020 
2001-12-31 1.095308 14.817514 25.652418 8.327481 ...  15.385689 
2002-12-31 29.744794 15.545211 6.373948 13.451261 ...  7.446414 
2003-12-31 14.971959 25.948332 21.596976 5.355589 ...  23.676867 
2004-12-31 0.604113 2.858745 0.120340 19.365223 ...  0.336213 
2005-12-31 6.260722 9.819337 19.573953 11.132919 ...  26.107100 
2006-12-31 10.341241 15.126506 3.349634 23.619127 ...  15.508680 
2007-12-31 20.033540 22.103483 7.674852 1.263726 ...  15.148461 
2008-12-31 28.233973 27.982105 17.037928 5.389418 ...  8.773618 
2009-12-31 4.400039 7.284556 11.825382 4.201001 ...  6.734423 
2010-12-31 26.086305 26.275027 8.069376 19.200344 ...  19.056528 
2011-12-31 29.215028 0.985623 4.813478 7.752540 ...  14.395423 
2012-12-31 4.690336 9.618306 25.492041 10.400292 ...  8.853903 
2013-12-31 8.227096 11.013431 0.996911 15.276574 ...  26.227540 
2014-12-31 23.440591 16.544698 2.263684 3.919315 ...  24.987387 

        357  358  359  360  361  362 \ 
1995-12-31 24.791125 21.443534 21.092439 8.289222 9.745293 20.084046 
1996-12-31 2.632656 2.102163 24.828437 18.104255 7.951859 3.266873 
1997-12-31 11.246534 14.086539 29.635519 19.518642 24.086108 6.041870 
1998-12-31 29.961162 9.924863 9.401790 25.597344 13.885467 16.537406 
1999-12-31 3.057125 15.241720 8.472388 3.248545 11.302522 19.283612 
2000-12-31 22.999729 17.518504 10.058249 2.953903 10.167712 17.309525 
2001-12-31 18.267445 23.205300 25.658591 19.915797 10.704525 26.604965 
2002-12-31 11.497110 3.641206 9.693428 24.571510 6.438652 29.280098 
2003-12-31 23.931401 19.967615 0.307896 0.385782 0.579257 7.534806 
2004-12-31 21.321146 9.224362 1.703842 6.180944 28.173925 5.178336 
2005-12-31 17.990409 28.746179 2.524899 10.555224 25.487723 19.877390 
2006-12-31 9.748760 29.069966 1.717175 3.283069 9.615215 25.787787 
2007-12-31 29.772930 20.892030 16.597493 20.079373 17.320327 9.583089 
2008-12-31 22.787891 26.636413 13.872783 29.305847 21.287553 1.263788 
2009-12-31 1.574188 23.172773 0.967153 1.928999 12.201354 0.125939 
2010-12-31 20.566125 0.429552 4.413156 16.106451 27.745684 18.280928 
2011-12-31 9.348584 2.604338 23.397221 7.378340 16.757224 29.364973 
2012-12-31 4.704570 7.278321 19.034622 24.597784 13.694635 15.912901 
2013-12-31 21.657446 14.110146 23.976991 8.203509 20.083490 4.471119 
2014-12-31 14.465823 9.105391 15.984162 6.796756 8.232619 18.761280 

        363  364  365 
1995-12-31 28.165022 9.735041  NaN 
1996-12-31 11.644543 4.139818 5.420238 
1997-12-31 2.500165 18.290531  NaN 
1998-12-31 23.856333 10.064951  NaN 
1999-12-31 3.090008 26.203395  NaN 
2000-12-31 22.216599 27.942821 0.791318 
2001-12-31 25.682003 4.766435  NaN 
2002-12-31 19.785159 28.972659  NaN 
2003-12-31 15.692168 21.388069  NaN 
2004-12-31 9.079675 7.392328 12.583179 
2005-12-31 18.202333 21.895494  NaN 
2006-12-31 20.951937 26.220226  NaN 
2007-12-31 23.603166 28.165377  NaN 
2008-12-31 20.532933 9.401494 25.296916 
2009-12-31 5.879644 10.377044  NaN 
2010-12-31 0.436284 20.875852  NaN 
2011-12-31 13.205290 6.832805  NaN 
2012-12-31 23.253155 17.760731 23.270751 
2013-12-31 19.807798 2.453238  NaN 
2014-12-31 12.817601 11.756561  NaN 

[20 rows x 366 columns] 
+0

实际上,我想要n_day日平均值的n_year序列,其中n_year是年数,n_day是每年价值数量,最有可能是365. –

+0

这是堆积行365或每天366次的平均值,每年一次? – Stefan

+0

是的,但所有行都应该有相同数量的值。 –

1

这里是我会怎么做。非常简单:用你想要的确切形状创建一个新的df,然后用你想要的东西填充它。

from datetime import datetime 
import numpy as np 
import pandas as pd 

# This is my re-creation of the data you have. (I'm calling it df1.) 
# It's essential that your date-time be in datetime.datetime format, not strings 
byear = 1996 # arbitrary 
eyear = 2005 # arbitrary 
obs_n = 50000 # arbitrary 
start_time = datetime.timestamp(datetime(byear,1,1,0,0,0,0)) 
end_time = datetime.timestamp(datetime(eyear,12,31,23,59,59,999999)) 
obs_times = np.linspace(start_time,end_time,num=obs_n) 
index1 = pd.Index([datetime.fromtimestamp(i) for i in obs_times]) 
df1 = pd.DataFrame(data=np.random.rand(obs_n)*20,index=index1,columns=['temp']) 
# ^some random data 

# Here is the new empty dataframe (df2) where you will put your daily averages. 
index2 = pd.Index(range(byear,eyear+1)) 
columns2 = range(1,367) # change to 366 if you want to assume 365-day years 
df2 = pd.DataFrame(index=index2,columns=columns2) 

# Some quick manipulations that allow the two dfs' indexes to talk to one another. 
df1['year'] = df1.index.year # a new column with the observation's year as an integer 
df1['day'] = df1.index.dayofyear # a new column with the day of the year as integer 
df1 = df1.reset_index().set_index(['year','day']) 

# Now get the averages for each day and assign them to df2. 
for year in index2: 
    for day in columns2[:365]: # for all but the last entry in the range 
     df2.loc[year,day] = df1.loc[(year,day),'temp'].mean() 
    if (year,366) in df1.index: # then if it's a leap year... 
     df2.loc[year,366] = df1.loc[(year,366),'temp'].mean() 

如果你不想最终DF会对那一天366任何空值,那么你可以删除最后的if语句,并重写columns2 = range(1,366),然后DF2将所有非空值(假设在观察时间段内每天至少有一次测量)。