查找两个文档之间的相似度

是否有一个内置算法来查找lucene中两个文档之间的相似度？当我通过默认的相似性类时，它比较查询和文档后给出得分作为结果。查找两个文档之间的相似度

我已经索引了我的文档a，使用了雪球分析器，下一步就是找到两个文档之间的相似度。

有人可以提出解决方案吗？

2012-01-13 CTsiddharth

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene – Mikos 2012-02-16 21:07:04

似乎没有内置算法。我相信有三种方法可以解决这个问题：

a）在其中一个文档上运行MoreLike查询。迭代结果，检查文档ID并获得分数。也许不是很漂亮，你可能需要为你想要返回的文件返回很多文件。 b）余弦相似度：Mikos在他的评论中提供的答案解释了如何计算两个文件的余弦相似度。

c）计算你自己的Lucene相似度分数。 Lucene得分给Cosine相似度增加了一些因素（http://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html）。

您可以使用

DefaultSimilarity ds = new DefaultSimilarity(); 
SimScorer scorer = ds.simScorer(stats , arc); 
scorer.score(otherDocId, freq);

您可以通过

AtomicReaderContext arc = IndexReader.leaves().get(0); 
SimWeight stats = ds.computeWeight(1, collectionStats, termStats); 
stats.normalize(1, 1);

得到例如参数，其中，反过来，你可以使用你的第一个两个文件的TermVector获得长期统计数据，以及您的IndexReader用于收集统计信息。要获得freq参数，使用

DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, null, field, term);

，通过文档迭代，直到你找到你的第一个文档的DOC的ID，并做

freq = docsEnum.freq();

请注意，你需要调用“scorer.score”对于你的第一个文档中的每个术语（或每个术语你想考虑），并总结结果。

最后，用“queryNorm”和“坐标”参数相乘，就可以使用

//sumWeights was computed while iterating over the first termvector 
//in the main loop by summing up "stats.getValueForNormalization();" 
float queryNorm = ds.queryNorm(sumWeights); 
//thisTV and otherTV are termvectors for the two documents. 
//overlap can be easily calculated 
float coord = ds.coord(overlap, (int) Math.min(thisTV.size(), otherTV.size())); 
return coord * queryNorm * score;

因此，这是一个应该工作的方式。它并不优雅，并且由于获得期限频率的困难（对每个术语迭代DocsEnum），它也不是很有效。我仍然希望这可以帮助某人:)

来源

2015-01-22 01:39:04

查找两个文档之间的相似度

回答

相关问题