2017-04-19 110 views
0

我想应用一个函数fn,这本质上是cosine distance计算在两​​个大的numpy阵列形状(10000,100)和(5000,100)row-wise,即我计算一个这些数组中行的每个组合的值。两个阵列之间的余弦距离计算 - Python

我的实现:

import math 
def fn(v1,v2): 
    sumxx, sumxy, sumyy = 0, 0, 0 
    for i in range(len(v1)): 
     x = v1[i]; y = v2[i] 
     sumxx += x*x 
     sumyy += y*y 
     sumxy += x*y 
    return sumxy/math.sqrt(sumxx*sumyy) 
val = [] 
for i in range(array1.shape[0]): 
    for j in range(array2.shape[0]): 
     val.append(fn(array1[i, :], array2[j, :])) 

功能非常快,只需要几毫秒:

CPU times: user 4 ms, sys: 0 ns, total: 4 ms 
Wall time: 1.24 ms 

有没有什么有效的方式做到这一点?

+0

'fn'计算两个向量之间的余弦相似度。我更新了这个问题 –

回答

1

方法1:我们可以简单地使用Scipy's cdistcosine距离的功能 -

from scipy.spatial.distance import cdist 

val_out = 1 - cdist(array1, array2, 'cosine') 

方法2:使用matrix-multiplication另一种方法 -

def cosine_vectorized(array1, array2): 
    sumyy = (array2**2).sum(1) 
    sumxx = (array1**2).sum(1, keepdims=1) 
    sumxy = array1.dot(array2.T) 
    return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy) 

方法#3 :使用np.einsum来计算自平方su对于另一个mmations -

def cosine_vectorized_v2(array1, array2): 
    sumyy = np.einsum('ij,ij->i',array2,array2) 
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None] 
    sumxy = array1.dot(array2.T) 
    return (sumxy/np.sqrt(sumxx))/np.sqrt(sumyy) 

方法#4:numexpr module瞻来卸载square-root计算为另一种方法 -

import numexpr as ne 

def cosine_vectorized_v3(array1, array2): 
    sumyy = np.einsum('ij,ij->i',array2,array2) 
    sumxx = np.einsum('ij,ij->i',array1,array1)[:,None] 
    sumxy = array1.dot(array2.T) 
    sqrt_sumxx = ne.evaluate('sqrt(sumxx)') 
    sqrt_sumyy = ne.evaluate('sqrt(sumyy)') 
    return ne.evaluate('(sumxy/sqrt_sumxx)/sqrt_sumyy') 

运行测试

# Using same sizes as stated in the question 
In [185]: array1 = np.random.rand(10000,100) 
    ...: array2 = np.random.rand(5000,100) 
    ...: 

In [194]: %timeit 1 - cdist(array1, array2, 'cosine') 
1 loops, best of 3: 366 ms per loop 

In [195]: %timeit cosine_vectorized(array1, array2) 
1 loops, best of 3: 287 ms per loop 

In [196]: %timeit cosine_vectorized_v2(array1, array2) 
1 loops, best of 3: 283 ms per loop 

In [197]: %timeit cosine_vectorized_v3(array1, array2) 
1 loops, best of 3: 217 ms per loop