使用TFIDF的余弦相似度

在SO和Web上有几个问题描述如何在两个字符串之间采用cosine similarity，甚至在TFIDF作为权重的两个字符串之间。但是像scikit的linear_kernel这样的函数的输出让我有点困惑。使用TFIDF的余弦相似度

考虑下面的代码：

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 

a = ['hello world', 'my name is', 'what is your name?'] 
b = ['my name is', 'hello world', 'my name is what?'] 

df = pd.DataFrame(data={'a':a, 'b':b}) 
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1) 
print(df.head()) 

        a     b         ab 
0   hello world  my name is    hello world my name is 
1   my name is  hello world    my name is hello world 
2 what is your name? my name is what? what is your name? my name is what?

问题：我想有一列，它是在a字符串和b琴弦之间的余弦相似性。

我试过：

我培养了TFIDF分类上ab，以包括所有的话：

clf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english') 
clf.fit(df['ab'])

然后我得到了两个a和b列的稀疏TFIDF矩阵：

tfidf_a = clf.transform(df['a']) 
tfidf_b = clf.transform(df['b'])

现在，如果我使用scikit的linear_kernel，这是别人推荐的，我得到了一个格式矩阵（nfeatures，nfeatures），正如他们的文档中提到的那样。

from sklearn.metrics.pairwise import linear_kernel 
linear_kernel(tfidf_a,tfidf_b) 

array([[ 0., 1., 0.], 
     [ 0., 0., 0.], 
     [ 0., 0., 0.]])

但我需要的是一个简单的矢量，其中所述第一元件是a第一行和b第一行，所述第二元件是所述cos_sim之间的cosin_sim（A [1]，B [ 1]）等等。

使用python3，scikit-learn 0.17。

来源

2016-04-21 David

我认为你的例子有点下降，因为你的TfidfVectorizer过滤了大部分词汇，因为你有stop_words ='english'参数（你在示例中包含了几乎所有的停用词）。我已经删除了它，并且让您的矩阵密集，以便我们可以看到发生了什么。如果你做了这样的事情怎么办？

import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer 
from scipy import spatial 

a = ['hello world', 'my name is', 'what is your name?'] 
b = ['my name is', 'hello world', 'my name is what?'] 

df = pd.DataFrame(data={'a':a, 'b':b}) 
df['ab'] = df.apply(lambda x : x['a'] + ' ' + x['b'], axis=1) 

clf = TfidfVectorizer(ngram_range=(1, 1)) 
clf.fit(df['ab']) 

tfidf_a = clf.transform(df['a']).todense() 
tfidf_b = clf.transform(df['b']).todense() 

row_similarities = [1 - spatial.distance.cosine(tfidf_a[x],tfidf_b[x]) for x in range(len(tfidf_a)) ] 
row_similarities 

[0.0, 0.0, 0.72252389079716417]

这显示了每一行之间的距离。我没有完全掌握如何构建完整的语料库，但这个例子并没有完全优化，所以我现在就离开它。希望这可以帮助。

来源

2016-04-23 16:14:29 flyingmeatball

谢谢，这工作。你为什么不跟我如何构建完整的语料库？ – David

因为通常有比使用.apply这种类型的任务更好的方法。这里有6个文档，两列中有3行，是否有两个单独的文档（a和b），或者是否有3个文档（每行一个）。这对计算TFIDF中的频率很重要，我不确定您构建ab的方式现在反映了您的意图。 – flyingmeatball

dfs = {} 
idfs = {} 
speeches = {} 
speechvecs = {} 
total_word_counts = {} 

def tokenize(doc): 
    tokens = mytokenizer.tokenize(doc) 
    lowertokens = [token.lower() for token in tokens] 
    filteredtokens = [stemmer.stem(token) for token in lowertokens if not token in sortedstopwords] 
    return filteredtokens 

def incdfs(tfvec): 
    for token in set(tfvec): 
     if token not in dfs: 
      dfs[token]=1 
      total_word_counts[token] = tfvec[token] 
     else: 
      dfs[token] += 1 
      total_word_counts[token] += tfvec[token] 


def calctfidfvec(tfvec, withidf): 
    tfidfvec = {} 
    veclen = 0.0 

    for token in tfvec: 
     if withidf: 
      tfidf = (1+log10(tfvec[token])) * getidf(token) 
     else: 
      tfidf = (1+log10(tfvec[token])) 
     tfidfvec[token] = tfidf 
     veclen += pow(tfidf,2) 

    if veclen > 0: 
     for token in tfvec: 
      tfidfvec[token] /= sqrt(veclen) 

    return tfidfvec 

def cosinesim(vec1, vec2): 
    commonterms = set(vec1).intersection(vec2) 
    sim = 0.0 
    for token in commonterms: 
     sim += vec1[token]*vec2[token] 

    return sim 

def query(qstring): 
    qvec = getqvec(qstring.lower()) 
    scores = {filename:cosinesim(qvec,tfidfvec) for filename, tfidfvec in speechvecs.items()} 
    return max(scores.items(), key=operator.itemgetter(1))[0] 

def docdocsim(filename1,filename2): 
    return cosinesim(gettfidfvec(filename1),gettfidfvec(filename2))

来源

2016-10-20 02:39:00

尽管这段代码可能会解决问题，但它并不能解释为什么或如何回答问题。请[请为您的代码添加解释]（// meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers），因为这确实有助于提高帖子的质量。请记住，您将来会为读者回答问题，而这些人可能不知道您的代码建议的原因。 –

使用TFIDF的余弦相似度

回答

相关问题