2017-09-24 109 views
1

我试图用pandas.DataFrame.rolling实现如下:大熊猫与单面窗口滚动

在指数i,我想滚动summeanmedian,...利用一个最后size_winparzen窗口。它是至关重要的只考虑过去的值(即索引<i)和而不是考虑来自未来的任何值(这是“我们在时间i?场景下有什么信息?”)。第二个约束是:我想要一个单边的窗口,即指数为i的值应该得到最大权重,更小权重的是i-1,权重更小的权重是......,最小权重是i-size_win

使用标准

df.rolling(window=size_win, win_type='parzen').sum()

不适合我的工作,因为它会给指数i最小重量和i-(size_win/2)的最大重量。提供center参数将给出指数i的最大权重,但也使用未来的>i值进行计算。

我发现了一个使用pandas.DataFrame.rolling(...).apply的解决方案,但那是(当然)非常慢。

请看下面的例子:

import time 

import pandas as pd 
import scipy as sp 
import numpy as np 

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)), columns=list('ABCD')) 

size_win = 1000 

def window_single_sided_parzen(window_size): 
    return sp.signal.parzen((window_size-1)*2+1)[0:window_size] 

def custom_rolling_sum(x, window): 
    return (x * window).sum() 

t_start = time.time() 
df_rolled_fast = df.rolling(window=size_win, win_type='parzen').sum() 
print(f'Run time of builtin: {time.time() - t_start:.2f} s') 

t_start = time.time() 
df_rolled = df.rolling(window=size_win).apply(lambda x: custom_rolling_sum(x, window_single_sided_parzen(size_win))) 
print(f'Run time of apply: {time.time() - t_start:.2f} s') 

的内置滚动需要1.3秒即可在我的情况(生产不是我想要的结果)和我自己的解决方案需要54秒。

任何想法如何更有效地解决这个问题?

回答

1

看准我自己的错误推理:

df_rolled = df.rolling(window=size_win).apply(lambda x: custom_rolling_sum(x, window_single_sided_parzen(size_win))) 

我天真地以为,它会调用该函数昂贵只window_single_sided_parzen(size_win)一次。事实上,它被称为每一行。切换到

win = window_single_sided_parzen(size_win) 
df_rolled = df.rolling(window=size_win).apply(lambda x: custom_rolling_sum(x, win)) 

要快得多。速度不如内置功能快,但足够快。

0

我认为这可能会很糟糕......但我对你的单边历史滚动平均值有类似的需求。我希望能够以一种正常的方式使用固有功能......我认为我这样做是这样做的:

# %% Import Base Packages 
import pandas as pd 
import re 
import numpy as np 
import matplotlib.pyplot as plt 
# end%% 

# %% Import other packages to overwrite 
from pandas.core import window as rwindow 
from pandas.core.dtypes.generic import (ABCSeries,ABCDataFrame) 
from pandas.core.dtypes.common import is_integer 
# end%% 

# %% Overwrite Functions and methods 
class Window_single_sided(rwindow.Window): 
    def _prep_window(self, **kwargs): 
     """ 
     provide validation for our window type, return the window 
     we have already been validated 
     """ 

     window = self._get_window() 
     if isinstance(window, (list, tuple, np.ndarray)): 
      return _asarray_tuplesafe(window).astype(float) 
     elif is_integer(window): 
      import scipy.signal as sig 

      # the below may pop from kwargs 
      def _validate_win_type(win_type, kwargs): 
       arg_map = {'kaiser': ['beta'], 
          'gaussian': ['std'], 
          'general_gaussian': ['power', 'width'], 
          'slepian': ['width']} 
       if win_type in arg_map: 
        return tuple([win_type] + _pop_args(win_type, 
                 arg_map[win_type], 
                 kwargs)) 
       return win_type 

      def _pop_args(win_type, arg_names, kwargs): 
       msg = '%s window requires %%s' % win_type 
       all_args = [] 
       for n in arg_names: 
        if n not in kwargs: 
         raise ValueError(msg % n) 
        all_args.append(kwargs.pop(n)) 
       return all_args 

      win_type = _validate_win_type(self.win_type, kwargs) 
      # GH #15662. `False` makes symmetric window, rather than periodic. 
      #----Only Line I changed to get a single sided window---- 
      return sig.get_window(win_type, (window-1)*2+1, False).astype(float)[0:window] 

def rolling_new(obj, win_type=None, **kwds): 
    if not isinstance(obj, (ABCSeries, ABCDataFrame)): 
     raise TypeError('invalid type: %s' % type(obj)) 


    if win_type is not None: 

     # ---Updated to use the new single_sided class when appropriate 
     if win_type.endswith('_single_sided'): 
      return Window_single_sided(obj, win_type=re.sub('\_single_sided$', '',win_type), **kwds) 
     #----Had to rwindow prefaces here... 
     return rwindow.Window(obj, win_type=win_type, **kwds) 

    return rwindow.Rolling(obj, **kwds) 

# Here we set this new method instead of the existing one. 
rwindow.rolling = rolling_new 
# end%% 

# %% Here we test it out 
df = pd.DataFrame([0,1,2,3,4,5,6,7,8]) 

df['triang'] = df[0].rolling(5,win_type='triang').sum() 
df['triang_single_sided'] = df[0].rolling(5,win_type='triang_single_sided').sum() 
df['boxcar'] = df[0].rolling(5,win_type='boxcar').sum() 
ax = df.plot(x=0,y=['triang','triang_single_sided','boxcar']) 
ax.set_ylabel('Sum with different Methods') 
# end%% 

# %% Here we test it out 
from scipy.stats import norm 
t = np.linspace(0,2*np.pi*2,5000) 
y = np.sin(t)*10 + norm.rvs(size=5000) 

df = pd.DataFrame({'t':t,'y':y}) 
df 
df['triang'] = df['y'].rolling(50,win_type='triang').mean() 
df['triang_single_sided'] = df['y'].rolling(50,win_type='triang_single_sided').mean() 
df['boxcar'] = df['y'].rolling(50,win_type='boxcar').mean() 
ax = df.plot(x=t,y=['y','triang','triang_single_sided','boxcar']) 
ax.set_ylabel('Mean with different Methods') 
plt.show() 
# end%%