从文本文件创建Python字典并检索每个单词的计数

我试图从文本文件创建一个单词字典，然后计算每个单词的实例，并能够搜索字典中的单词并接收它的数量，但我仍然站在一个立场。我在编写文本文字时遇到了最大的麻烦，并且删除了标点符号，否则我的计数将会关闭。有什么建议么？从文本文件创建Python字典并检索每个单词的计数

f=open("C:\Users\Mark\Desktop\jefferson.txt","r") 
wc={} 
words = f.read().split() 
count = 0 
i = 0 
for line in f: count += len(line.split()) 
for w in words: if i < count: words[i].translate(None, string.punctuation).lower() i += 1 else: i += 1 print words 
for w in words: if w not in wc: wc[w] = 1 else: wc[w] += 1 
print wc['states']

来源

2014-09-23 Murph

你有什么问题？ – 2014-09-23 01:30:16

的几点：

在Python，始终使用以下构建读取文件：

with open('ls;df', 'r') as f: 
    # rest of the statements

如果使用f.read().split()，那么它会读取到文件末尾。之后，你需要回到开头：

f.seek(0)

三，在你做的部分：

for w in words: 
    if i < count: 
     words[i].translate(None, string.punctuation).lower() 
     i += 1 
    else: 
     i += 1 
     print words

你不需要保留一个计数器在Python。你可以简单地做......

for i, w in enumerate(words): 
    if i < count: 
     words[i].translate(None, string.punctuation).lower() 
    else: 
     print words

但是，你甚至不需要检查i < count这里......您可以简单地这样做：

words = [w.translate(None, string.punctuation).lower() for w in words]

最后，如果你只是想算states，而不是创建项目的整个词典，可以考虑使用过滤器....

print len(filter(lambda m: m == 'states', words))

最后一件事...

如果文件很大，则不宜将每个单词放在内存中。考虑逐行更新wc字典。而不是做你做了什么的，你可以考虑：

for line in f: 
    words = line.split() 
    # rest of your code

来源

2014-09-23 01:29:40 ssm

这听起来像collections.Counter工作：

import collections 

with open('gettysburg.txt') as f: 
    c = collections.Counter(f.read().split()) 

print "'Four' appears %d times"%c['Four'] 
print "'the' appears %d times"%c['the'] 
print "There are %d total words"%sum(c.values()) 
print "The 5 most common words are", c.most_common(5)

结果：

$ python foo.py 
'Four' appears 1 times 
'the' appears 9 times 
There are 267 total words 
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]

当然，这种计算“自由”和“这个”。作为单词（注意单词中的标点符号）。另外，它将“The”和“the”作为不同的单词。另外，处理整个文件可能会导致非常大的文件丢失。

这是一个忽略标点符号和大小写的版本，它在大文件上的内存效率更高。

import collections 
import re 

with open('gettysburg.txt') as f: 
    c = collections.Counter(
     word.lower() 
     for line in f 
     for word in re.findall(r'\b[^\W\d_]+\b', line)) 

print "'Four' appears %d times"%c['Four'] 
print "'the' appears %d times"%c['the'] 
print "There are %d total words"%sum(c.values()) 
print "The 5 most common words are", c.most_common(5)

结果：

$ python foo.py 
'Four' appears 0 times 
'the' appears 11 times 
There are 271 total words 
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]

参考文献：

来源

2014-09-23 02:13:09

File_Name = 'file.txt' 
counterDict={} 

with open(File_Name,'r') as fh: 
    for line in fh: 
    # removing their punctuation 
     words = line.replace('.','').replace('\'','').replace(',','').lower().split() 
     for word in words: 
      if word not in counterDict: 
       counterDict[word] = 1 
      else: 
       counterDict[word] = counterDict[word] + 1 

print('Count of the word > common< :: ', counterDict.get('common',0))

来源

2017-02-20 04:20:36

从文本文件创建Python字典并检索每个单词的计数

回答

相关问题