2016-06-08 58 views
1

我都喜欢我怎么能基于值范围组数据

[312.281, 
370.401, 
254.245, 
272.256, 
312.325, 
286.243, 
271.231, ...] 

数据,那么我想通过的取值范围组他们通过

for i in data: 
    if i in range(200,300): 
     data_200_300.append(i) 
    elif i in range(300,400): 
     data_300_400.append(i) 

它不能正常工作,有什么代码应该我用?

回答

0

,有一个以上的字符串,如果更快的选项条件或lambda过滤器。它使用逻辑索引:

def indexingversion(data, bin_start, bin_end, bin_step): 
    x = np.array(data) 
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step) 
    bin_number = bin_edges.size - 1 
    cond = np.zeros((x.size, bin_number), dtype=bool) 
    for i in range(bin_number): 
     cond[:, i] = np.logical_and(bin_edges[i] < x, 
            x < bin_edges[i+1]) 
    return [list(x[cond[:, i]]) for i in range(bin_number)] 

我已经把迄今所有的解决方案和我自己的功能版本,跑一次全部,使用线分析器(rkern/line_profiler)。最后一行证明了所有三个输出是相同的(这使得我的版本稍微有些变化,因为我必须在开始时将它转换为numpy数组,并且最终返回)。

我的版本和lambda版本还有另外一个好处,您可以将它们分组到其他分档中,您必须在第一个解决方案中重写if -statements。

import numpy as np 

def forloop(x): 
    data_200_300 = [] 
    data_300_400 = [] 
    for i in x: 
     if 200 < i < 300: 
      data_200_300.append(i) 
     elif 300 < i < 400: 
      data_300_400.append(i) 
    return [data_200_300, data_300_400] 


def lambdaversion(data, bin_start, bin_end, bin_step): 
    filtered_data = [] 
    for i in range(bin_start,bin_end,bin_step): 
     filtered_data.append(filter(lambda x: i < x < i+bin_step, data)) 
    return filtered_data 


def indexingversion(data, bin_start, bin_end, bin_step): 
    x = np.array(data) 
    bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step) 
    bin_number = bin_edges.size - 1 
    cond = np.zeros((x.size, bin_number), dtype=bool) 
    for i in range(bin_number): 
     cond[:, i] = np.logical_and(bin_edges[i] < x, 
            x < bin_edges[i+1]) 
    return [list(x[cond[:, i]]) for i in range(bin_number)] 


#@profile 
def run_all(): 
    n = 100000 
    x = np.random.random_integers(200, 400, n) + np.random.ranf(n) 
    bin_start = 200 
    bin_end = 400 
    bin_step = 100 
    a = forloop(x) 
    b = lambdaversion(x, bin_start, bin_end, bin_step) 
    c = indexingversion(x, bin_start, bin_end, bin_step) 
    print('All the same? - ' + str(a == b == c)) 


if __name__ == '__main__': 
    run_all() 

仿形输出:

All the same? - True 
Wrote profile results to bla.py.lprof 
Timer unit: 1e-06 s 

Total time: 0.580098 s 
File: bla.py 
Function: run_all at line 32 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    32           @profile 
    33           def run_all(): 
    34   1   1  1.0  0.0  n = 100000 
    35   1   3311 3311.0  0.6  x = np.random.random_integers(200, 400, n) + np.random.ranf(n) 
    36   1   2  2.0  0.0  bin_start = 200 
    37   1   1  1.0  0.0  bin_end = 400 
    38   1   0  0.0  0.0  bin_step = 100 
    39   1  263073 263073.0  45.3  a = forloop(x) 
    40   1  301819 301819.0  52.0  b = lambdaversion(x, bin_start, bin_end, bin_step) 
    41   1   7514 7514.0  1.3  c = indexingversion(x, bin_start, bin_end, bin_step) 
    42   1   4377 4377.0  0.8  print('All the same? - ' + str(a == b == c)) 

正如你可以看到(在Time% Time柱)时,numpy的索引为约40或50倍的因数更快,至少100,000号。但是,对于非常小的数值,它会变慢(在我的机器上,它的启动速度约为40个值)。

3

range返回两个数字之间的整数列表,而您的数据包含浮点数。您可以使用><这直接使用Comparisons

for i in data: 
    if 200 < i < 300: 
     data_200_300.append(i) 
    elif 300 < i < 400: 
     data_300_400.append(i) 

如果你想一些比赛是包容性,可以使用<=为好。

+0

如果我想到组中的像 DF = [ID,V1,V2,V3 1,12,32,23 2,65,45,22 3,55,34,76 列。 ..] 如果我想基于V3 colunn组合,我该怎么办? –

0

@AKS正确回答了这个问题,你也可以用lambda表达式来尝试。

result = filter(lambda x: 200 < x < 300, data) 

,如果你有很多这样的价值观和进口numpy的可能性,你可以使用这个喜欢它来处理你的数据

filtered_data = [] 
for i in range(200,400,100): 
    filtered_data.append(filter(lambda x: i < x < i+100, data)) 

>>> filtered_data 
[[254.245, 272.256, 286.243, 271.231], [312.281, 370.401, 312.325]]