2015-10-13 60 views
0

我有句listOfSentences的名单看起来是这样的:有比较子元素列表中的另一个

listOfSentences = ['mary had a little lamb.', 
        'she also had a little pram.', 
        'bam bam bam she also loves ham.', 
        'she ate the lamb.'] 

我也keywords字典,看起来像这样:

keyWords= {('bam', 3), ('lamb', 2), ('ate', 1)} 

哪里该词的频率越高,其在keyWords中的键越小。

>>> print(keySentences) 
>>> ['bam bam bam she also loves ham.', 'she ate the lamb.',] 

我的问题是:我怎么能在元素keyWordslistOfSentences比较的元素,这样我可以输出列表keySentences

回答

1

keyWords如果它是一个字典,它更有用,那么它就是一个简单的字典查找来获得每个单词的分数。每个单词可以使用split()来提取。

下面是一些代码来做到这一点。这假定标点符号是一个字的一部分(如您的示例结果列表keySentences暗示):

listOfSentences = ['mary had a little lamb.', 
        'she also had a little pram.', 
        'bam bam bam she also loves ham.', 
        'she ate the lamb.'] 

keyWords= [('bam', 3), ('lamb', 2), ('ate', 1)] 
keyWords = dict(keyWords) 

keySentences = [] 
for sentence in listOfSentences: 
    score = sum(keyWords.get(word, 0) for word in sentence.split()) 
    if score > 0: 
     keySentences.append((score, sentence)) 

keySentences = [sentence for score, sentence in sorted(keySentences, reverse=True)] 
print(keySentences) 

输出

 
['bam bam bam she also loves ham.', 'she ate the lamb.'] 

如果你想忽略标点符号你可以将其删除加工前的每句话:

import string 

# mapping to remove punctuation with str.translate() 
remove_punctuation = {ord(c): None for c in string.punctuation} 

listOfSentences = ['mary had a little lamb.', 
        'she also had a little pram.', 
        'bam bam bam she also loves ham.', 
        'she ate the lamb.'] 

keyWords= [('bam', 3), ('lamb', 2), ('ate', 1)] 
keyWords = dict(keyWords) 

keySentences = [] 
for sentence in listOfSentences: 
    score = sum(keyWords.get(word, 0) for word in sentence.translate(remove_punctuation).split()) 
    if score > 0: 
     keySentences.append((score, sentence)) 

keySentences = [sentence for score, sentence in sorted(keySentences, reverse=True)] 
print(keySentences) 

输出

 
['bam bam bam she also loves ham.', 'she ate the lamb.', 'mary had a little lamb.'] 

现在结果列表中还包括“玛丽有只小羊羔”。因为整个尾随的“羊肉”被str.translate()删除。

0

尝试这样的:

>>> [x for x in listOfSentences for i in keyWords if x.count(i[0])==i[1]] 
['bam bam bam she also loves ham.', 'she ate the lamb.'] 
+0

这也将匹配'迟到ate' – The6thSense

+0

OP只说了词,可能是他需要完全匹配 – Hackaholic

+0

这就是我说你在这里做部分匹配,你是如何来到这个逻辑我不明白什么OP要求 – The6thSense

1

下面将根据匹配字数得分你的句子:

import re 

keyWords = [('bam', 3), ('lamb', 2), ('ate', 1)] 
keyWords = [w for w, c in keyWords]  # only need the words 

listOfSentences = [ 
    'mary had a little lamb.', 
    'she also had a little pram.', 
    'bam bam bam she also loves ham.', 
    'she ate the lamb.']  

words = [re.findall(r'(\w+)', s) for s in listOfSentences] 
keySentences = [] 

for word_list, sentence in zip(words, listOfSentences): 
    keySentences.append((len([word for word in word_list if word in keyWords]), sentence)) 

for count, sentence in sorted(keySentences, reverse=True): 
    print '{:2} {}'.format(count, sentence) 

给你以下的输出:

3 bam bam bam she also loves ham. 
2 she ate the lamb. 
1 mary had a little lamb. 
0 she also had a little pram 
相关问题