方法1:我们可以简单地使用Scipy's cdist
其cosine
距离的功能 -
from scipy.spatial.distance import cdist
val_out = 1 - cdist(array1, array2, 'cosine')
方法2:使用matrix-multiplication
另一种方法 -
def cosine_vectorized(array1, array2):
sumyy = (array2**2).sum(1)
sumxx = (array1**2).sum(1, keepdims=1)
sumxy = array1.dot(array2.T)
return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy)
方法#3 :使用np.einsum
来计算自平方su对于另一个mmations -
def cosine_vectorized_v2(array1, array2):
sumyy = np.einsum('ij,ij->i',array2,array2)
sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
sumxy = array1.dot(array2.T)
return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy)
方法#4:在numexpr
module瞻来卸载square-root
计算为另一种方法 -
import numexpr as ne
def cosine_vectorized_v3(array1, array2):
sumyy = np.einsum('ij,ij->i',array2,array2)
sumxx = np.einsum('ij,ij->i',array1,array1)[:,None]
sumxy = array1.dot(array2.T)
sqrt_sumxx = ne.evaluate('sqrt(sumxx)')
sqrt_sumyy = ne.evaluate('sqrt(sumyy)')
return ne.evaluate('(sumxy/sqrt_sumxx)/sqrt_sumyy')
运行测试
# Using same sizes as stated in the question
In [185]: array1 = np.random.rand(10000,100)
...: array2 = np.random.rand(5000,100)
...:
In [194]: %timeit 1 - cdist(array1, array2, 'cosine')
1 loops, best of 3: 366 ms per loop
In [195]: %timeit cosine_vectorized(array1, array2)
1 loops, best of 3: 287 ms per loop
In [196]: %timeit cosine_vectorized_v2(array1, array2)
1 loops, best of 3: 283 ms per loop
In [197]: %timeit cosine_vectorized_v3(array1, array2)
1 loops, best of 3: 217 ms per loop
'fn'计算两个向量之间的余弦相似度。我更新了这个问题 –