2017-04-04 222 views
1

我想比较两个numpy数组的元素,并删除其中一个数组的元素,如果坐标之间的无核距离小于1并且时间相同。 data_CD4和data_CD8是数组。数组的元素是3D坐标列表,时间为第四元素(numpy.array([[x,y,z,time],[x,y,z,time] .....])。是截止,这里1Python:比较两个数组的元素

for i in data_CD8: 
     for m in data_CD4: 
      if distance.euclidean(tuple(i[:3]),tuple(m[:3])) < co and i[3]==m[3] : 
       data_CD8=np.delete(data_CD8, i, 0) 

是否有快速的方法来做到这一点?第一个数组有5000元,第二2000,因此它tooks太多时间。

+0

这应该是'[3]','不[3:]'。 – trincot

+0

如果你想要你也可以使用numpy来进行比较,请查看:http://stackoverflow.com/questions/10580676/comparing-two-numpy-arrays-for-equality-element-wise – LethalProgrammer

+0

正如@trincot指出的那样它必须是'distance.euclidean(tuple(i [:3]),tuple(m [:3]))''。你能证实吗? – Divakar

回答

2

下面是使用Scipy's cdist一个量化的方法 -

from scipy.spatial import distance 

# Get eucliden distances between first three cols off data_CD8 and data_CD4 
dists = distance.cdist(data_CD8[:,:3], data_CD4[:,:3]) 

# Get mask of those distances that are within co distance. This sets up the 
# first condition requirement as posted in the loopy version of original code. 
mask1 = dists < co 

# Take the third column off the two input arrays that represent the time values. 
# Get the equality between all time values off data_CD8 against all time values 
# off data_CD4. This sets up the second conditional requirement. 
# We are adding a new axis with None, so that NumPY broadcasting 
# would let us do these comparisons in a vectorized manner. 
mask2 = data_CD8[:,3,None] == data_CD4[:,3] 

# Combine those two masks and look for any match correponding to any 
# element off data_CD4. Since the masks are setup such that second axis 
# represents data_CD4, we need numpy.any along axis=1 on the combined mask. 
# A final inversion of mask is needed as we are deleting the ones that 
# satisfy these requirements. 
mask3 = ~((mask1 & mask2).any(1)) 

# Finally, using boolean indexing to select the valid rows off data_CD8 
out = data_CD8[mask3] 
+0

嗯,当你试试你的代码时,什么都不会从数组中删除。通过我的代码,data_CD8中一半的elemts被删除。现在我不能说为什么。 – Varlor

+0

@Varlor它创建一个删除为'data_CD8_out'的新数组。您是否验证该数组中的值?或者只是用'data_CD8 = data_CD8 [〜((mask1&mask2).any(1))]''指定回来? – Divakar

+0

因此,data_CD8_out是没有满足条件的元素的原始数组?你能否解释你的代码?它似乎非常快,我想了解它:) – Varlor

0

,如果你有比较data_CD4中的所有项目到data_CD8 中的项目,同时从data_CD8中删除数据,可能会更好地在每次迭代中使第二个迭代更小,这当然取决于您最常见的 个案

for m in data_CD4: 
    for i in data_CD8: 
     if distance.euclidean(tuple(i[3:]),tuple(m[3:])) < co and i[3]==m[3] : 
      data_CD8 = np.delete(data_CD8, i, 0) 

基于大O表示法 - 而且由于这是O(n^2) - 我没有看到一个更快的 解决方案。

2

这应该是一个矢量化的方法。

mask1 = np.sum((data_CD4[:, None, :3] - data_CD8[None, :, :3])**2, axis = -1) < co**2 
mask2 = data_CD4[:, None, 3] == data_CD8[None, :, 3] 
mask3 = np.any(np.logical_and(mask1, mask2), axis = 0) 
data_CD8 = data_CD8[~mask3] 

mask1应该加快距离计算,因为它不需要平方根调用。 mask1mask2是我们通过np.any挤压到1d的二维数组。最后的所有删除操作都可以防止一堆读/写操作。

速试验:

a = np.random.randint(0, 10, (100, 3)) 

b = np.random.randint(0, 10, (100, 3)) 

%timeit cdist(a,b) < 5 #Divakar's answer 
10000 loops, best of 3: 133 µs per loop 

%timeit np.sum((a[None, :, :] - b[:, None, :]) ** 2, axis = -1) < 25 # My answer 
1000 loops, best of 3: 418 µs per loop 

和C编译的代码胜,加入不必要的平方根即使。

+0

感谢您的努力。尝试代码时出现此错误: IndexError:索引3284超出轴0的大小2587 – Varlor

+0

很难说错误是什么,但在'mask3'中尝试'axis = 0' –

+0

Aaand就像在Divakar的回答中,你需要反转'mask3' –