2016-06-13 114 views
2

我拥有数百万行的数据框'data'。每行都有坐标('x','y'),我想用python可以提供的最有效的方式计算连续坐标对之间的距离。并行化在这里会有帮助吗?使用Python代码并行化计算两点之间距离的最快方法

我在这里看到了一些建议使用cython的方法。不过,我想只看到python解决方案。

这里是我的数据

points = 
[(26406, -6869), 
(27679, -221), 
(27679, -221), 
(26416, -6156), 
(26679, -578), 
(26679, -580), 
(27813, -558), 
(26254, -1097), 
(26679, -580), 
(27813, -558), 
(28258, -893), 
(26253, -1098), 
(26678, -581), 
(27811, -558), 
(28259, -893), 
(26252, -1098), 
(27230, -481), 
(26679, -582), 
(27488, -5849), 
(27811, -558), 
(28259, -893), 
(26250, -1099), 
(27228, -481), 
(26679, -582), 
(27488, -5847), 
(28525, -1465), 
(27811, -558), 
(28259, -892)] 

的片段我相信用我的第一种方法的for循环,可以明显的改善:

from scipy.spatial import distance 
    def comp_dist(points): 
     size =len(points) 
     d = 0 
     i=1 
     for i in range(1,size): 
      if i%1000000==0: 
       print i 
      # print "i-1:", points[i-1] 
      # print "i: ", points[i] 
      dist = distance.euclidean(points[i-1],points[i]) 
      d= d+dist 
     print d 

    distance = comp_dist(points) 

谢谢你的答案提前。

+1

使用numpy的会比当前解决方案更快,更方便了很多比用Cython实现。但它不会给你并行化(但它可能会针对你的CPU进行优化)。 – Evert

+0

如果您打算继续使用多处理路径,您需要将大列表分成块,处理这些块,然后在末尾合并它们 – kezzos

+0

您认为这会提高性能吗? –

回答

1

下面是一个简单的例子,以帮助您开始:

from scipy.spatial import distance 
from multiprocessing import Pool 

processes = 4 

# Group data into pairs in order to compute distance 
pairs = [(points[i], points[i+1]) for i in range(len(points)-1)] 
print pairs 

# Split data into chunks 
l = [pairs[i:i+processes] for i in xrange(0, len(pairs), processes)] 


def worker(lst): 
    return [distance.euclidean(i[0], i[1]) for i in lst] 

if __name__ == "__main__": 
    p = Pool(processes) 
    result = p.map(worker, l) 
    # Flatten list 
    print [item for sublist in result for item in sublist] 

测试此与:

points =[(random.randint(0,1000), random.randint(0, 1000)) for i in range(1000000)] 

随着8过程大约需要5秒,以1接管10秒。

2

你说的蟒蛇,但既然你已经使用scipy的距离计算我假设一个numpy的解决方案是好的。

在2800万点numpy阵列上使用矢量化的单线程操作在我的笔记本电脑上只需要1秒。使用32位整数数据类型,该阵列在内存中占用大约200MB。

import numpy as np 
points = [(26406, -6869), ..., (28259, -892)] 
# make test array my repeating the 28-element points list 1M times 
np_points = np.array(points*1000000, dtype='int32') 
# use two different slices (offset by 1) from resulting array; 
# execution of next line takes ~1 second 
dists = np.sqrt(np.sum((np_points[0:-2] - np_points[1:-1])**2, axis=1)) 
print(dists.shape) 
(27999998,) 

print(dists[:28]) 
[ 6.76878372e+03 0.00000000e+00 6.06789865e+03 5.58419672e+03 
    2.00000000e+00 1.13421338e+03 1.64954600e+03 6.69263775e+02 
    1.13421338e+03 5.57000898e+02 2.01545280e+03 6.69263775e+02 
    1.13323343e+03 5.59400572e+02 2.01744244e+03 1.15636197e+03 
    5.60180328e+02 5.32876815e+03 5.30084993e+03 5.59400572e+02 
    2.01953386e+03 1.15689585e+03 5.58213221e+02 5.32679134e+03 
    4.50303153e+03 1.15431581e+03 5.58802291e+02 6.25764636e+03] 
+0

您可以将其与流程级并行化结合使用,但它不可能提供帮助,因为与流程初始化一起复制的开销相对于工作量而言很大。 – jvd10