2012-07-18 32 views
2

这个问题是问这里之前模糊分组依据,将类似的词

What is a good strategy to group similar words?

,但没有明确的答案是如何“组”项目给出。基于difflib的解决方案基本上是搜索,对于给定的项目,difflib可以从列表中返回最相似的单词。但是,如何将其用于分组?

我想,以减少

['ape', 'appel', 'apple', 'peach', 'puppy'] 

['ape', 'appel', 'peach', 'puppy'] 

['ape', 'apple', 'peach', 'puppy'] 

一个想法,我试过了,对于每个项目,遍历列表,如果get_close_matches回报不止一个匹配,使用它,如果不保留这个词。这部分工作,但它可以建议苹果appel,然后appel苹果,这些话只会改变地方,什么都不会改变。

我将不胜感激任何指针,图书馆的名称等

注:另外在性能方面,我们有30万项的列表,并get_close_matches似乎有点慢。有没有人知道基于C/++的解决方案?

感谢,

注:进一步的调查显示kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要“中心”,它需要/使用数据点自己为中心(这点是称为medoids,因此名称)。在单词分组的情况下,medoid将是该组/集群的代表性元素。

回答

5

您需要规范化组。在每个组中,选择一个代表该组的代码或代码。然后由他们的代表分组。

一些可能的方式:

  • 选择遇到的第一句话。
  • 选择词典第一个单词。
  • 导出所有单词的模式。
  • 选择一个唯一索引。
  • 使用soundex作为模式。

尽管如此,将单词分组可能会很困难。如果A与B类似,并且B与C类似,则A和C不一定彼此相似。如果B是代表,则A和C都可以包括在组中。但如果A或C是代表,另一个不能包括在内。


由第一替代(第一次遇到的字)进行中:

class Seeder: 
    def __init__(self): 
     self.seeds = set() 
     self.cache = dict() 

    def get_seed(self, word): 
     LIMIT = 2 
     seed = self.cache.get(word,None) 
     if seed is not None: 
      return seed 
     for seed in self.seeds: 
      if self.distance(seed, word) <= LIMIT: 
       self.cache[word] = seed 
       return seed 
     self.seeds.add(word) 
     self.cache[word] = word 
     return word 

    def distance(self, s1, s2): 
     l1 = len(s1) 
     l2 = len(s2) 
     matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] 
     for zz in xrange(0,l2): 
      for sz in xrange(0,l1): 
       if s1[sz] == s2[zz]: 
        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz]) 
       else: 
        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1) 
     return matrix[l2][l1] 

import itertools 

def group_similar(words): 
    seeder = Seeder() 
    words = sorted(words, key=seeder.get_seed) 
    groups = itertools.groupby(words, key=seeder.get_seed) 
    return [list(v) for k,v in groups] 

实施例:

import pprint 

print pprint.pprint(group_similar([ 
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we', 
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if', 
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make', 
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take', 
    'people', 'into', 'year', 'your', 'good', 'some', 'could', 
    'them', 'see', 'other', 'than', 'then', 'now', 'look', 
    'only', 'come', 'its', 'over', 'think', 'also', 'back', 
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well', 
    'way', 'even', 'new', 'want', 'because', 'any', 'these', 
    'give', 'day', 'most', 'us' 
]), width=120) 

输出:

[['after'], 
['also'], 
['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'], 
['back'], 
['because'], 
['but', 'about', 'get', 'just'], 
['first'], 
['from'], 
['good', 'look'], 
['have', 'make', 'give'], 
['his', 'her', 'if', 'him', 'its', 'how', 'us'], 
['into'], 
['know', 'new'], 
['like', 'time', 'take'], 
['most'], 
['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'], 
['only'], 
['over', 'our', 'even'], 
['people'], 
['say', 'she', 'way', 'day'], 
['some', 'see', 'come'], 
['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'], 
['think'], 
['well'], 
['what', 'who', 'when', 'than'], 
['with', 'will', 'which'], 
['work'], 
['would', 'could'], 
['year', 'your']] 
+0

找到组将是困难的部分。我想我可以使用一个聚类算法,作为距离测量的levenshtein距离。在确定集群后,我选择其中一个词(任何一个)作为该集团的代表。 – user423805 2012-07-18 06:47:58

+0

或者,您可以使用两组词语之间的平均成对Levenshtein距离作为它们之间的距离度量(许多分层聚类算法以这种方式工作)。最大成对距离也可能起作用。 – 2012-07-18 06:57:34

+0

团体的代表的可爱的说明。但是,我会建议(双)metaoun over soundex。 +1 – 2012-07-18 07:11:53

3

你必须决定封闭的匹配词,你想使用哪些词。可以从get_close_matches返回的列表中获取第一个元素,或者仅使用该列表上的随机函数,并从关闭的匹配中获取一个元素。

必须有某种规则,因为它..

In [19]: import difflib 

In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy'] 

In [21]: a = ['appel', 'apple', 'peach', 'puppy'] 

In [22]: b = difflib.get_close_matches('ape',a) 

In [23]: b 
Out[23]: ['apple', 'appel'] 

In [24]: import random 

In [25]: c = random.choice(b) 

In [26]: c 
Out[26]: 'apple' 

In [27]: 

现在,从最初的名单中删除C,这就是它... 对于C++,你可以使用Levenshtein_distance

0

这里基于medoids的方法。首先安装MlPy。在Ubuntu

sudo apt-get install python-mlpy 

然后

import numpy as np 
import mlpy 

class distance:  
    def compute(self, s1, s2): 
     l1 = len(s1) 
     l2 = len(s2) 
     matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] 
     for zz in xrange(0,l2): 
      for sz in xrange(0,l1): 
       if s1[sz] == s2[zz]: 
        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz]) 
       else: 
        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1) 
     return matrix[l2][l1] 

x = np.array(['ape', 'appel', 'apple', 'peach', 'puppy']) 

km = mlpy.Kmedoids(k=3, dist=distance()) 
medoids,clusters,a,b = km.compute(x) 

print medoids 
print clusters 
print a 

print x[medoids] 
for i,c in enumerate(x[medoids]): 
    print "medoid", c 
    print x[clusters[a==i]] 

输出是

[4 3 1] 
[0 2] 
[2 2] 
['puppy' 'peach' 'appel'] 
medoid puppy 
[] 
medoid peach 
[] 
medoid appel 
['ape' 'apple'] 

越大单词列表和使用K = 10

medoid he 
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see' 
'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over' 
'after' 'get' 'what' 'I'] 
medoid out 
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but' 
'about'] 
medoid to 
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two'] 
medoid now 
['new' 'how' 'know' 'not'] 
medoid time 
['like' 'take' 'come' 'some' 'give'] 
medoid because 
[] 
medoid an 
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day' 
'any'] 
medoid look 
['work' 'good'] 
medoid will 
['with' 'well' 'which'] 
medoid then 
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there' 
'than' 'them'] 
+0

有趣的是我对这个答案没有评论或投票。嗯.. – user423805 2012-07-20 06:02:26

1

的另一种方法,可以利用矩阵因式分解,使用SVD。首先我们创建词距矩阵,对于100个词,这将是100×100矩阵,表示从每个词到所有其他词的距离。然后,SVD运行在这个矩阵上,结果u,s,v中的u可以看作是每个簇的成员强度。

代码

import numpy as np 
import scipy.linalg as lin 
import Levenshtein as leven 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
import itertools 

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we', 
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if', 
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make', 
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take', 
    'people', 'into', 'year', 'your', 'good', 'some', 'could', 
    'them', 'see', 'other', 'than', 'then', 'now', 'look', 
    'only', 'come', 'its', 'over', 'think', 'also', 'back', 
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well', 
    'way', 'even', 'new', 'want', 'because', 'any', 'these', 
    'give', 'day', 'most', 'us']) 

print "calculating distances..." 

(dim,) = words.shape 

f = lambda (x,y): leven.distance(x,y) 
res=np.fromiter(itertools.imap(f, itertools.product(words, words)), 
       dtype=np.uint8) 
A = np.reshape(res,(dim,dim)) 

print "svd..." 

u,s,v = lin.svd(A, full_matrices=False) 

print u.shape 
print s.shape 
print s 
print v.shape 

data = u[:,0:10] 
k=KMeans(init='k-means++', k=25, n_init=10) 
k.fit(data) 
centroids = k.cluster_centers_ 
labels = k.labels_ 
print labels 

for i in range(np.max(labels)): 
    print words[labels==i] 

def dist(x,y): 
    return np.sqrt(np.sum((x-y)**2, axis=1)) 

print "centroid points.." 
for i,c in enumerate(centroids): 
    idx = np.argmin(dist(c,data[labels==i])) 
    print words[labels==i][idx] 
    print words[labels==i] 

plt.plot(centroids[:,0],centroids[:,1],'x') 
plt.hold(True) 
plt.plot(u[:,0], u[:,1], '.') 
plt.show() 

from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure() 
ax = Axes3D(fig) 
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0, 
     zdir='z', label='zs=0, zdir=z') 
plt.show() 

结果

any 
['and' 'an' 'can' 'any'] 
do 
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how'] 
when 
['who' 'when' 'well'] 
my 
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use'] 
your 
['for' 'or' 'out' 'about' 'your' 'our'] 
its 
['it' 'his' 'if' 'him' 'its'] 
could 
['would' 'people' 'could'] 
this 
['this' 'think' 'these'] 
she 
['the' 'he' 'she' 'see'] 
back 
['all' 'back' 'want'] 
one 
['of' 'on' 'one' 'only' 'even' 'new'] 
just 
['but' 'just' 'first' 'most'] 
come 
['some' 'come'] 
that 
['that' 'than'] 
way 
['say' 'what' 'way' 'day'] 
like 
['like' 'time' 'give'] 
in 
['in' 'into'] 
get 
['her' 'get' 'year'] 
because 
['because'] 
will 
['with' 'will' 'which'] 
over 
['other' 'over' 'after'] 
as 
['a' 'as' 'at' 'also' 'us'] 
them 
['they' 'there' 'their' 'them' 'then'] 
good 
['not' 'from' 'know' 'good' 'now' 'look' 'work'] 
have 
['have' 'make' 'take'] 

k的对于群集的数量的选择是重要的,K = 25给出大于k = 20例如更好的结果。

该代码还通过选取u [...]坐标最接近聚类质心的单词为每个聚类选择一个代表性单词。

3

这是另一个使用Affinity Propagation算法的版本。

import numpy as np 
import scipy.linalg as lin 
import Levenshtein as leven 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.cluster import AffinityPropagation 
import itertools 

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you', 
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we', 
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if', 
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make', 
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take', 
    'people', 'into', 'year', 'your', 'good', 'some', 'could', 
    'them', 'see', 'other', 'than', 'then', 'now', 'look', 
    'only', 'come', 'its', 'over', 'think', 'also', 'back', 
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well', 
    'way', 'even', 'new', 'want', 'because', 'any', 'these', 
    'give', 'day', 'most', 'us']) 

print "calculating distances..." 

(dim,) = words.shape 

f = lambda (x,y): -leven.distance(x,y) 

res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8) 
A = np.reshape(res,(dim,dim)) 

af = AffinityPropagation().fit(A) 
cluster_centers_indices = af.cluster_centers_indices_ 
labels = af.labels_ 

unique_labels = set(labels) 
for i in unique_labels: 
    print words[labels==i] 

距离必须转换为相似之处,我通过采用距离的负值做到了这一点。输出是

['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two'] 
['it' 'with' 'at' 'if' 'get' 'its' 'first'] 
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work'] 
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give'] 
['this' 'his' 'which' 'him'] 
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any'] 
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over'] 
['would' 'could'] 
['but' 'out' 'about' 'our' 'most'] 
['make' 'like' 'time' 'take' 'back'] 
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think' 
'even' 'these'] 
['not' 'no' 'know' 'now' 'how' 'new'] 
['will' 'people' 'year' 'well'] 
['say' 'what' 'way' 'want' 'day'] 
['because'] 
['as' 'up' 'just' 'use' 'us'] 
+0

首先让我们说感谢你为我节省了很长时间的研究!我在你的代码中增加了一个改进,而不是在Affinitty传播算法中使用默认的欧几里德距离,我将它改为:预先计算,所以我改变了这个af = AffinityPropagation中的行(affinity ='precomputed')。fit(A)和得到比默认更好的结果。 – 2017-06-23 08:40:58