在numpy数组中查找满足条件的大量连续值

我在numpy数组中加载了一些音频数据，我希望通过查找无声部分来分段数据，即音频幅度低于aa某个阈值的部分时间段。在numpy数组中查找满足条件的大量连续值

极其简单的方法来做到这一点是这样的：

values = ''.join(("1" if (abs(x) < SILENCE_THRESHOLD) else "0" for x in samples)) 
pattern = re.compile('1{%d,}'%int(MIN_SILENCE))                   
for match in pattern.finditer(values): 
    # code goes here

上述代码发现其中存在比SILENCE_THRESHOLD至少MIN_SILENCE连续元素更小的部分。

现在，很明显，上面的代码是非常低效率和可怕的滥用正则表达式。还有其他一些更高效的方法，但是仍然会导致同样简单和短的代码？

来源

2010-12-20 pafcu

这是一个基于numpy的解决方案。

我认为（？）它应该比其他选项更快。希望这很清楚。

但是，它确实需要两倍于各种基于生成器的解决方案的内存。只要你可以在内存中保存一份临时数据（用于比较），以及一个与数据长度相同的布尔数组（每个元素为1位），它应该非常高效......

import numpy as np 

def main(): 
    # Generate some random data 
    x = np.cumsum(np.random.random(1000) - 0.5) 
    condition = np.abs(x) < 1 

    # Print the start and stop indicies of each region where the absolute 
    # values of x are below 1, and the min and max of each of these regions 
    for start, stop in contiguous_regions(condition): 
     segment = x[start:stop] 
     print start, stop 
     print segment.min(), segment.max() 

def contiguous_regions(condition): 
    """Finds contiguous True regions of the boolean array "condition". Returns 
    a 2D array where the first column is the start index of the region and the 
    second column is the end index.""" 

    # Find the indicies of changes in "condition" 
    d = np.diff(condition) 
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right. 
    idx += 1 

    if condition[0]: 
     # If the start of condition is True prepend a 0 
     idx = np.r_[0, idx] 

    if condition[-1]: 
     # If the end of condition is True, append the length of the array 
     idx = np.r_[idx, condition.size] # Edit 

    # Reshape the result into two columns 
    idx.shape = (-1,2) 
    return idx 

main()

来源

2010-12-21 00:11:32

这导致令人印象深刻的20倍加速！它没有考虑到最小长度，但这很容易添加。唯一的问题是增加的内存使用情况，使得在某些情况下使用它是不可行的，所以我想我会默认使用这个选项，并且在内存不足时添加一个选项来使用另一种算法。 – pafcu 2010-12-21 05:46:56

随着numpy 1.9，我得到一个'DeprecationWarning：numpy布尔subtract（二进制运算符）已弃用'使用np.diff在布尔条件。我用'd = np.subtract（condition [1：]，condition [： - 1]，dtype = np.float）'替换了这一行，以避免这个问题。 – daryl 2014-09-29 15:30:43

@daryl - 感谢您注意到变化！可以更清楚地做'd = np.diff（condition.astype（int））'，尽管这主要是个人偏好的问题。 – 2014-09-29 19:10:38

我还没有测试过，但你应该接近你要找的东西。略多行代码，但应该更高效，可读的，它不滥用正则表达式:-)

def find_silent(samples): 
    num_silent = 0 
    start = 0 
    for index in range(0, len(samples)): 
     if abs(samples[index]) < SILENCE_THRESHOLD: 
      if num_silent == 0: 
       start = index 
      num_silent += 1 
     else: 
      if num_silent > MIN_SILENCE: 
       yield samples[start:index] 
      num_silent = 0 
    if num_silent > MIN_SILENCE: 
     yield samples[start:] 

for match in find_silent(samples): 
    # code goes here

来源

2010-12-20 22:37:13

你的代码看起来不错，只是如果沉默片断在样本的末尾，那么它将不会被发现。你需要在for循环之后检查它。 – 2010-12-20 22:48:24

@Justin：谢谢，在编辑中补充说。 – 2010-12-20 23:45:45

这应返回的(start,length)双列表：

def silent_segs(samples,threshold,min_dur): 
    start = -1 
    silent_segments = [] 
    for idx,x in enumerate(samples): 
    if start < 0 and abs(x) < threshold: 
     start = idx 
    elif start >= 0 and abs(x) >= threshold: 
     dur = idx-start 
     if dur >= min_dur: 
     silent_segments.append((start,dur)) 
     start = -1 
    return silent_segments

和简单测试：

>>> s = [-1,0,0,0,-1,10,-10,1,2,1,0,0,0,-1,-10] 
>>> silent_segs(s,2,2) 
[(0, 5), (9, 5)]

来源

2010-12-20 22:41:07 job

这似乎比基于正则表达式的解决方案快大约25％。尼斯。现在只需要9分钟:-) – pafcu 2010-12-20 23:23:46

稍有马虎，但简单快速十岁上下，如果你不介意使用SciPy的：

from scipy.ndimage import gaussian_filter 
sigma = 3 
threshold = 1 
above_threshold = gaussian_filter(data, sigma=sigma) > threshold

这个想法是，数据的安静部分将平滑到低振幅，而响亮的区域则不会。调整'西格玛'影响'安静'区域必须持续多久;调整“门槛”来影响它必须是多么安静。这对于大西格玛来说会变慢，此时使用基于FFT的平滑可能会更快。

这还有另一个好处，即单个“热像素”不会中断你的沉默发现，所以你对某些类型的噪音不那么敏感。

来源

2010-12-20 23:53:43 Andrew

另一种方式来快速而简洁地做到这一点：

import pylab as pl 

v=[0,0,1,1,0,0,1,1,1,1,1,0,1,0,1,1,0,0,0,0,0,1,0,0] 
vd = pl.diff(v) 
#vd[i]==1 for 0->1 crossing; vd[i]==-1 for 1->0 crossing 
#need to add +1 to indexes as pl.diff shifts to left by 1 

i1=pl.array([i for i in xrange(len(vd)) if vd[i]==1])+1 
i2=pl.array([i for i in xrange(len(vd)) if vd[i]==-1])+1 

#corner cases for the first and the last element 
if v[0]==1: 
    i1=pl.hstack((0,i1)) 
if v[-1]==1: 
    i2=pl.hstack((i2,len(v)))

现在I1包含起始索引和i2的1月底指数，...，1个区

来源

2011-02-23 10:51:05 Brano

有一个非常方便的解决方案，使用scipy.ndimage。对于数组：

a = array([1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0])

其可施加到另一阵列的条件的结果，发现该连续的区域是简单的：

regions = scipy.ndimage.find_objects(scipy.ndimage.label(a)[0])

然后，在施加任何功能，这些区域可以是完成例如像：

[np.sum(a[r]) for r in regions]

来源

2015-06-15 19:00:52

@乔金通，我用argmax代替了约20％-25％的速度提高了np.diff/np.nonzero溶液（见下面的代码，condition是布尔）

def contiguous_regions(condition): 
    idx = [] 
    i = 0 
    while i < len(condition): 
     x1 = i + condition[i:].argmax() 
     try: 
      x2 = x1 + condition[x1:].argmin() 
     except: 
      x2 = x1 + 1 
     if x1 == x2: 
      if condition[x1] == True: 
       x2 = len(condition) 
      else: 
       break 
     idx.append([x1,x2]) 
     i = x2 
    return idx

当然，您的里程可能会因您的数据而异。

此外，我不完全确定，但我猜numpy可能会优化argmin/argmax布尔数组停止搜索第一True/False发生。这可能可以解释它。

来源

2015-07-01 17:09:44 user2154321

在numpy数组中查找满足条件的大量连续值

回答

相关问题