Python：找到文本中单词列表的最佳/有效方式？

我有一个约300个单词的列表和大量的文本，我想扫描以知道每个单词出现多少次。Python：找到文本中单词列表的最佳/有效方式？

我使用re模块蟒蛇：

for word in list_word: 
    search = re.compile(r"""(\s|,)(%s).?(\s|,|\.|\))""" % word) 
    occurrences = search.subn("", text)[1]

，但我想知道是否有这样做的更有效或更优雅的方式？

来源

2010-07-30 Mermoz

你可以使用单词而不是检查周围的空格和标点符号。 '\ bWORD \ b' – mpen 2010-07-30 14:20:51

如果您想超越词频并查看文本分类，您可能需要查看以下内容： http://streamhacker.com/2010/06/16/text-classification-sentiment-分析 - 消除 - 低信息功能/ – monkut 2010-07-30 14:30:49

如果您将它放在内存中，**巨大**可以如何处理？ – FMc 2010-07-30 17:16:13

如果你有大量的文本，我不会用在这种情况下，正则表达式，但简单地拆分文本：

words = {"this": 0, "that": 0} 
for w in text.split(): 
    if w in words: 
    words[w] += 1

的话会给你的频率为每字

来源

2010-07-30 14:25:40

绝对更高效，只扫描一次文本。上面的代码片段似乎缺少检查该单词是300个“重要”单词之一的检查。 – pdbartlett 2010-07-30 14:28:12

@pdbartlett'如果用单词w进行检查。 – Wilduck 2010-07-30 14:41:42

分割空白并不总是会导致完美的结果。如果你需要复杂的分割，你可以看看下面提出的NLTK。 – 2010-07-30 20:40:46

谷歌搜索：蟒蛇频率给了我这个页面的第一个结果：http://www.daniweb.com/code/snippet216747.html

这似乎是你在找什么。

来源

2010-07-30 14:22:24

它具有所有这些正则表达式的非pythonish。分割成单独的单词最好用str.split（）来实现，而不是自定义正则表达式 – 2010-07-30 14:36:52

你是对的，如果Python字符串函数足够，它们应该用来代替正则表达式。 – 2010-07-30 16:36:51

您也可以将文本拆分为单词并搜索结果列表。

来源

2010-07-30 14:23:04

正则表达式可能不是你想要的。 Python有一些内置的字符串操作，其中的速度更快，我相信.count（）具有你所需要的。

http://docs.python.org/library/stdtypes.html#string-methods

来源

2010-07-30 14:24:01 chimeracoder

尝试从文本中删除所有标点符号，然后拆分空格。后来干脆

for word in list_word: 
    occurence = strippedText.count(word)

或者，如果你正在使用Python 3.0，我认为你可以这样做：

occurences = {word: strippedText.count(word) for word in list_word}

来源

2010-07-30 14:27:18 jacobangel

in 2.6 <= python <3.0你可以在list_word中为word做'occurences = dict（（word，strippedText.count（word））'） – Wilduck 2010-07-30 14:44:55

如果Python是不是必须的，你可以用awk

$ cat file 
word1 
word2 
word3 
word4 

$ cat file1 
blah1 blah2 word1 word4 blah3 word2 
junk1 junk2 word2 word1 junk3 
blah4 blah5 word3 word6 end 

$ awk 'FNR==NR{w[$1];next} {for(i=1;i<=NF;i++) a[$i]++}END{for(i in w){ if(i in a) print i,a[i] } } ' file file1 
word1 2 
word2 2 
word3 1 
word4 1

来源

2010-07-30 14:41:57 ghostdog74

它听起来像自然语言工具包可能有你需要的东西。

http://www.nltk.org/

来源

2010-07-30 15:20:27 Glenjamin

'nltk.FreqDist'类。 – 2010-07-30 20:38:44

也许你能适应这个我multisearch发生器功能。

from itertools import islice 
testline = "Sentence 1. Sentence 2? Sentence 3! Sentence 4. Sentence 5." 
def multis(search_sequence,text,start=0): 
    """ multisearch by given search sequence values from text, starting from position start 
     yielding tuples of text before sequence item and found sequence item""" 
    x='' 
    for ch in text[start:]: 
     if ch in search_sequence: 
      if x: yield (x,ch) 
      else: yield ch 
      x='' 
     else: 
      x+=ch 
    else: 
     if x: yield x 

# split the first two sentences by the dot/question/exclamation. 
two_sentences = list(islice(multis('.?!',testline),2)) ## must save the result of generation 
print "result of split: ", two_sentences 

print '\n'.join(sentence.strip()+sep for sentence,sep in two_sentences)

来源

2010-07-30 15:56:07

Python：找到文本中单词列表的最佳/有效方式？

回答

相关问题