2017-09-15 98 views
2

我的测量值和相应的权重一大熊猫数据帧的一系列加权值的:平滑在numpy的/熊猫

df = pd.DataFrame({'x': np.random.randn(1000), 'w': np.random.rand(1000)}) 

我要平滑的测量值(x)同时服用逐元素 权重( w)。这与滑动窗户的重量无关,我还希望应用其中的 (例如,三角形窗口或更有用的东西)。因此,为了计算每个窗口内的平滑值,该函数不仅应该通过窗函数(例如三角形)对x的切片元素进行加权,还要对w中的对应元素加权。

据我所知,pd.rolling_apply不会这样做,因为它将 功能分别应用于xw。同样,pd.rolling_window也不考虑源DataFrame的元素明确的权重;加权窗口(例如'三角形')可以是用户定义的,但是被固定在前面。

这里是我的缓慢上下的实现:

def rolling_weighted_triangle(x, w, window_size): 
    """Smooth with triangle window, also using per-element weights.""" 
    # Simplify slicing 
    wing = window_size // 2 

    # Pad both arrays with mirror-image values at edges 
    xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]] 
    wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]] 

    # Generate a (triangular) window of weights to slide 
    incr = 1./(wing + 1) 
    ramp = np.arange(incr, 1, incr) 
    triangle = np.r_[ramp, 1.0, ramp[::-1]] 

    # Apply both sets of weights over each window 
    slices = (slice(i - wing, i + wing + 1) for i in xrange(wing, len(x) + wing)) 
    out = (np.average(xp[slc], weights=triangle * wp[slc]) for slc in slices) 
    return np.fromiter(out, x.dtype) 

我怎么能加快这与numpy的/ SciPy的/熊猫吗?

数据帧可能占用RAM的一小部分(10k到200M行),例如,为每个元素预先分配一个二维窗口权重阵列太多了。我试图尽量减少使用临时阵列,也许使用 np.lib.stride_tricks.as_stridednp.apply_along_axisnp.convolve,但没有找到任何东西来完全复制上述内容。

这里有一个统一的窗口等价物,而不是一个三角形(使用get_sliding_window trick from here) - 接近,但也不能令人信服:

def get_sliding_window(a, width): 
    """Sliding window over a 2D array. 

    Source: https://stackoverflow.com/questions/37447347/dataframe-representation-of-a-rolling-window/41406783#41406783 
    """ 
    # NB: a = df.values or np.vstack([x, y]).T 
    s0, s1 = a.strides 
    m, n = a.shape 
    return as_strided(a, 
        shape=(m-width+1, width, n), 
        strides=(s0, s0, s1)) 


def rolling_weighted_average(x, w, window_size): 
    """Rolling weighted average with a uniform 'boxcar' window.""" 
    wing = window_size // 2 
    window_size = 2 * wing + 1 
    xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]] 
    wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]] 
    x_w = np.vstack([xp, wp]).T 
    wins = get_sliding_window(x_w, window_size) 
    # TODO - apply triangle window weights - multiply over wins[,:,1]? 
    result = np.average(wins[:,:,0], axis=1, weights=wins[:,:,1]) 
    return result 
+0

这不等于在'w * x'上应用窗口吗?也许你可以先生成该列? – VBB

+0

它似乎不是。给定窗口片内的平均值不一定为0. –

回答

1

您可以简单地使用卷积那里,像这样 -

def rolling_weighted_triangle_conv(x, w, window_size): 
    """Smooth with triangle window, also using per-element weights.""" 
    # Simplify slicing 
    wing = window_size // 2 

    # Pad both arrays with mirror-image values at edges 
    xp = np.concatenate((x[wing-1::-1], x, x[:-wing-1:-1])) 
    wp = np.concatenate((w[wing-1::-1], w, w[:-wing-1:-1])) 

    # Generate a (triangular) window of weights to slide 
    incr = 1./(wing + 1) 
    ramp = np.arange(incr, 1, incr) 
    triangle = np.r_[ramp, 1.0, ramp[::-1]] 

    D = np.convolve(wp*xp, triangle)[window_size-1:-window_size+1] 
    N = np.convolve(wp, triangle)[window_size-1:-window_size+1]  
    return D/N 

运行测试

In [265]: x = np.random.randn(1000) 
    ...: w = np.random.rand(1000) 
    ...: WSZ = 7 
    ...: 

In [266]: out1 = rolling_weighted_triangle(x, w, window_size=WSZ) 
    ...: out2 = rolling_weighted_triangle_conv(x, w, window_size=WSZ) 
    ...: print(np.allclose(out1, out2)) 
    ...: 
True 

In [267]: %timeit rolling_weighted_triangle(x, w, window_size=WSZ) 
    ...: %timeit rolling_weighted_triangle_conv(x, w, window_size=WSZ) 
    ...: 
100 loops, best of 3: 10.2 ms per loop 
10000 loops, best of 3: 32.9 µs per loop 

300x+加速!

+0

太棒了。这种方法还可以轻松插入Kaiser等另一种窗口形状来代替三角形。 –

+0

@EricTalevich是!任何种类的重量func都可以在那里插入。 – Divakar