,有一个以上的字符串,如果更快的选项条件或lambda过滤器。它使用逻辑索引:
def indexingversion(data, bin_start, bin_end, bin_step):
x = np.array(data)
bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
bin_number = bin_edges.size - 1
cond = np.zeros((x.size, bin_number), dtype=bool)
for i in range(bin_number):
cond[:, i] = np.logical_and(bin_edges[i] < x,
x < bin_edges[i+1])
return [list(x[cond[:, i]]) for i in range(bin_number)]
我已经把迄今所有的解决方案和我自己的功能版本,跑一次全部,使用线分析器(rkern/line_profiler)。最后一行证明了所有三个输出是相同的(这使得我的版本稍微有些变化,因为我必须在开始时将它转换为numpy数组,并且最终返回)。
我的版本和lambda版本还有另外一个好处,您可以将它们分组到其他分档中,您必须在第一个解决方案中重写if
-statements。
import numpy as np
def forloop(x):
data_200_300 = []
data_300_400 = []
for i in x:
if 200 < i < 300:
data_200_300.append(i)
elif 300 < i < 400:
data_300_400.append(i)
return [data_200_300, data_300_400]
def lambdaversion(data, bin_start, bin_end, bin_step):
filtered_data = []
for i in range(bin_start,bin_end,bin_step):
filtered_data.append(filter(lambda x: i < x < i+bin_step, data))
return filtered_data
def indexingversion(data, bin_start, bin_end, bin_step):
x = np.array(data)
bin_edges = np.arange(bin_start, bin_end + bin_step, bin_step)
bin_number = bin_edges.size - 1
cond = np.zeros((x.size, bin_number), dtype=bool)
for i in range(bin_number):
cond[:, i] = np.logical_and(bin_edges[i] < x,
x < bin_edges[i+1])
return [list(x[cond[:, i]]) for i in range(bin_number)]
#@profile
def run_all():
n = 100000
x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
bin_start = 200
bin_end = 400
bin_step = 100
a = forloop(x)
b = lambdaversion(x, bin_start, bin_end, bin_step)
c = indexingversion(x, bin_start, bin_end, bin_step)
print('All the same? - ' + str(a == b == c))
if __name__ == '__main__':
run_all()
仿形输出:
All the same? - True
Wrote profile results to bla.py.lprof
Timer unit: 1e-06 s
Total time: 0.580098 s
File: bla.py
Function: run_all at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 @profile
33 def run_all():
34 1 1 1.0 0.0 n = 100000
35 1 3311 3311.0 0.6 x = np.random.random_integers(200, 400, n) + np.random.ranf(n)
36 1 2 2.0 0.0 bin_start = 200
37 1 1 1.0 0.0 bin_end = 400
38 1 0 0.0 0.0 bin_step = 100
39 1 263073 263073.0 45.3 a = forloop(x)
40 1 301819 301819.0 52.0 b = lambdaversion(x, bin_start, bin_end, bin_step)
41 1 7514 7514.0 1.3 c = indexingversion(x, bin_start, bin_end, bin_step)
42 1 4377 4377.0 0.8 print('All the same? - ' + str(a == b == c))
正如你可以看到(在Time
或% Time
柱)时,numpy的索引为约40或50倍的因数更快,至少100,000号。但是,对于非常小的数值,它会变慢(在我的机器上,它的启动速度约为40个值)。
如果我想到组中的像 DF = [ID,V1,V2,V3 1,12,32,23 2,65,45,22 3,55,34,76 列。 ..] 如果我想基于V3 colunn组合,我该怎么办? –