这听起来像collections.Counter
工作:
import collections
with open('gettysburg.txt') as f:
c = collections.Counter(f.read().split())
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
结果:
$ python foo.py
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]
当然,这种计算“自由”和“这个”。作为单词(注意单词中的标点符号)。另外,它将“The”和“the”作为不同的单词。另外,处理整个文件可能会导致非常大的文件丢失。
这是一个忽略标点符号和大小写的版本,它在大文件上的内存效率更高。
import collections
import re
with open('gettysburg.txt') as f:
c = collections.Counter(
word.lower()
for line in f
for word in re.findall(r'\b[^\W\d_]+\b', line))
print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)
结果:
$ python foo.py
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]
参考文献:
你有什么问题? – 2014-09-23 01:30:16