2016-09-07 85 views
2

我需要在大矩阵上使用Scikit-learn sklearn.metric.pairwise.cosine_similarity。 对于一些优化我需要计算矩阵的一些行,所以我尝试了不同的方法。Numpy Cosine在大集合上的类似差异

我发现,在某些情况下结果取决于载体的大小是不同的,我看到在这个测试情况下,这个奇怪的行为(大载体,调换和估计余弦):

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import spatial 
import numpy as np 
from scipy.sparse import csc_matrix 

size=200 
a=np.array([[1,0,1,0]]*size) 
sparse_a=csc_matrix(a.T) 
#standard cosine similarity between the whole transposed matrix, take only the first row 
res1=cosine_similarity(a.T,a.T)[0] 
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first row calculus only) 
res2=cosine_similarity([a.T[0]],a.T)[0] 
#sparse matrix implementation with the transposed, which should be faster 
res3=cosine_similarity(sparse_a,sparse_a)[0] 
print("res1: ",res1) 
print("res2: ",res2) 
print("res3: ",res3) 
print("res1 vs res2: ",res1==res2) 
print("res1 vs res3: ",res1==res3) 
print("res2 vs res3: ", res2==res3) 

如果 “大小” 设置为我得到这个结果,这是确定:

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [ True True True True] 
res1 vs res3: [ True True True True] 
res2 vs res3: [ True True True True] 

但如果“大小“设置为以上,一些奇怪的事情发生了:

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [False True False True] 
res1 vs res3: [False True False True] 
res2 vs res3: [ True True True True] 

有谁知道我错过了什么?

在此先感谢

回答

0

为了比较numpy.array你必须使用np.isclose不是相等运算符。尝试:

from sklearn.metrics.pairwise import cosine_similarity 
from scipy import spatial 
import numpy as np 
from scipy.sparse import csc_matrix 

size=2000 
a=np.array([[1,0,1,0]]*size) 
sparse_a=csc_matrix(a.T) 
#standard cosine similarity between the whole transposed matrix, take only the first row 
res1=cosine_similarity(a.T,a.T)[0] 
#take the row obtained by the multiplication of the first row of the transposed matrix with transposed matrix itself (optimized for the first  row calculus only) 
res2=cosine_similarity([a.T[0]],a.T)[0] 
#sparse matrix implementation with the transposed, which should befaster 
res3=cosine_similarity(sparse_a,sparse_a)[0] 
print("res1: ",res1) 
print("res2: ",res2) 
print("res3: ",res3) 
print("res1 vs res2: ", np.isclose(res1, res2)) 
print("res1 vs res3: ", np.isclose(res1, res3)) 
print("res2 vs res3: ", np.isclose(res2, res2)) 

的结果是:

res1: [ 1. 0. 1. 0.] 
res2: [ 1. 0. 1. 0.] 
res3: [ 1. 0. 1. 0.] 
res1 vs res2: [ True True True True] 
res1 vs res3: [ True True True True] 
res2 vs res3: [ True True True True] 

预期。

+0

非常感谢你的回答,我跑了它,它的工作原理。但是根据文档,np.iscloseto()*“返回一个布尔数组,其中两个数组在元素方向上的公差范围内相等。”* 这似乎证实了矩阵中的值不完全相同(实际上它们是在公差范围内彼此接近)。 我的问题的关键是**为什么cosine_similarity在不同的情况下返回不同的值**。 –

+0

'cosine_similarity'在不同情况下不会返回不同的值。它总是返回'[1. 0. 1. 0.]'。问题在于比较方式。 'numpy.array'不能使用'==' –