Python：在字符串列表中查找未知的重复单词

我有一个字符串列表，它们是来自不同电子邮件对话的主题。我想看看是否有经常使用的单词或单词组合。Python：在字符串列表中查找未知的重复单词

一个例子清单将是：

subjects = [ 
       'Proposal to cooperate - Company Name', 
       'Company Name Introduction', 
       'Into Other Firm/Company Name', 
       'Request for Proposal' 
      ]

功能必须检测“公司名称”的组合被使用超过一次，而“建议”被多次使用。这些单词虽然不会事先知道，但我想它必须开始尝试所有可能的组合。

实际列表当然比这个例子长很多，所以手动尝试所有组合似乎并不是最好的方法。什么是最好的方式去做这件事？

UPDATE

我用添Pietzcker的回答开始开发这个功能，但我得到停留在正确运用计数器。它不断返回列表的长度作为所有短语的计数。

短语功能，包括标点符号过滤器，如果这句话已经查了检查，并每短语最大长度的3个字：通过科目列表

def phrases(string, phrase_list): 
    words = string.split() 
    result = [] 
    punctuation = '\'\"-_,.:;!? ' 
    for number in range(len(words)): 
     for start in range(len(words)-number): 
     if number+1 <= 3: 
      phrase = " ".join(words[start:start+number+1]) 
      if phrase in phrase_list: 
      pass 
      else: 
      phrase_list.append(phrase) 
      phrase = phrase.strip(punctuation).lower() 
      if phrase: 
       result.append(phrase) 
    return result, phrase_list

然后循环：

phrase_list = [] 
ranking = {} 
for s in subjects: 
    result, phrase_list = phrases(s, phrase_list) 
    all_phrases = collections.Counter(phrase.lower() for s in subjects for phrase in result)

“all_phrases”返回一个元组列表，其中每个计数值为167，这是我使用的主题列表的长度。不知道我在这里失去了什么......

来源

2016-03-03 Vincent

这不是重复的。至少不是那个特定的问题。这不是关于列表中的项目，而是关于字符串列表中的常见短语。请在结束前阅读标题。 –

建议的重复问题绝不会回答我的问题... – Vincent

刚刚重新打开它。 –

你也想找到那些由比单词短语。没问题。这应该甚至可以很好地扩展。

import collections 

subjects = [ 
       'Proposal to cooperate - Company Name', 
       'Company Name Introduction', 
       'Into Other Firm/Company Name', 
       'Request for Proposal', 
       'Some more Firm/Company Names' 
      ] 

def phrases(string): 
    words = string.split() 
    result = [] 
    for number in range(len(words)): 
     for start in range(len(words)-number): 
      result.append(" ".join(words[start:start+number+1])) 
    return result

phrases()按空白进行分割输入字符串，并返回任意长度的所有可能的子功能：

In [2]: phrases("A Day in the Life") 
Out[2]: 
['A', 
'Day', 
'in', 
'the', 
'Life', 
'A Day', 
'Day in', 
'in the', 
'the Life', 
'A Day in', 
'Day in the', 
'in the Life', 
'A Day in the', 
'Day in the Life', 
'A Day in the Life']

现在你可以指望有多少次，每次这些短语的所有主题中找到：

all_phrases = collections.Counter(phrase for subject in subjects for phrase in phrases(subject))

结果：

In [3]: print([(phrase, count) for phrase, count in all_phrases.items() if count > 1]) 
Out [3]: 
[('Company', 4), ('Proposal', 2), ('Firm', 2), ('Name', 3), ('Company Name', 3), 
('Firm /', 2), ('/', 2), ('/ Company', 2), ('Firm/Company', 2)]

请注意，您可能希望使用其他标准，而不是简单地将空格分开，可能忽略标点符号和大小写等。

来源

2016-03-04 07:20:15

谢谢，这是一个很好的开始。我已经在循环中实现了这一点，但在柜台上遇到了一些麻烦。我已经更新了最新状态的问题。 – Vincent

我建议你使用空格作为分隔符，否则如果你没有指定允许的“短语”应该是什么样子，那么存在太多的可能性。

要指望出现的词语，您可以使用Counter从collections模块：

import operator 
from collections import Counter 

d = Counter(' '.join(subjects).split()) 

# create a list of tuples, ordered by occurrence frequency 
sorted_d = sorted(d.items(), key=operator.itemgetter(1), reverse=True) 

# print all entries that occur more than once 
for x in sorted_d: 
    if x[1] > 1: 
     print(x[1], x[0])

输出：

3 Name 
3 Company 
2 Proposal

来源

2016-03-03 15:20:40

谢谢，这很有帮助。可能通过首先获得重复的单词，然后我可以开始寻找单词组合，使用这个函数找到的单词。我会稍微玩一下，然后在这里发表我的结果。 – Vincent

使用'split（）'标记句子的可能替代方法，您也可以使用'nltk'中的'work_tokenize（）'函数。 http://www.nltk.org/book/ch03.html –

到PP_的回答相似。使用分割。

import operator 

subjects = [ 
      'Proposal to cooperate - Company Name', 
      'Company Name Introduction', 
      'Into Other Firm/Company Name', 
      'Request for Proposal' 
     ] 
flat_list = [item for i in subjects for item in i.split() ] 
count_dict = {i:flat_list.count(i) for i in flat_list} 
sorted_dict = sorted(count_dict.items(), reverse=True, key=operator.itemgetter(1))

输出：

[('Name', 3), 
('Company', 3), 
('Proposal', 2), 
('Other', 1), 
('/', 1), 
('for', 1), 
('cooperate', 1), 
('Request', 1), 
('Introduction', 1), 
('Into', 1), 
('-', 1), 
('to', 1), 
('Firm', 1)]

来源

2016-03-03 15:42:18 Faller

Python：在字符串列表中查找未知的重复单词

回答

相关问题