Cosine similarity是widely used用于n克计数或TFIDF载体。
from math import pi, acos
def similarity(x, y):
return sum(x[k] * y[k] for k in x if k in y)/sum(v**2 for v in x.values())**.5/sum(v**2 for v in y.values())**.5
余弦相似性可以被用于计算一个正式的距离度量according to wikipedia。它遵循,你会期望的距离(对称,非负性,等等)的所有属性:
def distance_metric(x, y):
return 1 - 2 * acos(similarity(x, y))/pi
这些度量的两个范围0和1之间
如果你有tokenizer产生N-从字符串克,你可以使用这些指标是这样的:
>>> import Tokenizer
>>> tokenizer = Tokenizer(ngrams=2, lower=True, nonwords_set=set(['hello', 'and']))
>>> from Collections import Counter
>>> list(tokenizer('Hello World again and again?'))
['world', 'again', 'again', 'world again', 'again again']
>>> Counter(tokenizer('Hello World again and again?'))
Counter({'again': 2, 'world': 1, 'again again': 1, 'world again': 1})
>>> x = _
>>> Counter(tokenizer('Hi world once again.'))
Counter({'again': 1, 'world once': 1, 'hi': 1, 'once again': 1, 'world': 1, 'hi world': 1, 'once': 1})
>>> y = _
>>> sum(x[k]*y[k] for k in x if k in y)/sum(v**2 for v in x.values())**.5/sum(v**2 for v in y.values())**.5
0.42857142857142855
>>> distance_metric(x, y)
0.28196592805724774
我发现Counter
优雅的内积this SO answer
我很想知道你的问题是否要求距离服从[三角不等式](http://en.wikipedia.org/wiki/Triangle_inequality),如果是的话,你认为哪些解决方案最令人满意。 – 2012-11-29 17:20:20