我必须从sklearn KDTree中查询大量的向量,它是搜索器类的路径。我试图使用python multiprocessing并行查询它们,但并行代码与单一版本几乎相同(或更多)的时间。Python多处理:检查内存是共享还是被复制
import time, numpy as np
from sklearn.neighbors import KDTree
from multiprocessing import Pool
def glob_query(arg, **kwarg):
return Searcher.query(*arg, **kwarg)
class Searcher:
def __init__(self, N, D):
self.kdt = KDTree(np.random.rand(N,D), leaf_size=30, metric="euclidean")
def query(self, X):
return self.kdt.query(X, k=5, return_distance=False)
def query_sin(self, X):
return [self.query(x) for x in X]
def query_par(self, X):
p = Pool(4)
return p.map(glob_query, zip([self]*len(X), X))
if __name__=="__main__":
N = 1000000 # Number of points to be indexed
D = 50 # Dimensions
searcher = Searcher(N, D)
E = 100 # Number of points to be searched
points = np.random.rand(E, D)
# Works fine
start = time.time()
searcher.query_sin(points)
print("Time taken - %f"%(time.time()-start))
# Slower than single core
start = time.time()
print searcher.query_par(points)
print("Time taken - %f"%(time.time()-start))
Time taken - 28.591089
Time taken - 36.920716
我想知道
- 如果我的kd树被在每个工作线程
- 复制是那里parallelise搜索的另一种方法(使用凄楚?)
如果我在'init'创建池,我得到一个错误说'池对象不能处理或pickled' – kampta
@kampta之间进行传递:如果你确实是最终需要传递'pool',你可以做所以使用'pathos' ......实质上,你可以进行嵌套的'map'调用(或'map'变体)。 –