为文本文件中的每一行计数（并书写）文字频率

第一次张贴在文本文件中 - 总能找到以前能够解决问题的问题！我的主要问题是逻辑......即使是伪代码答案也会很棒。为文本文件中的每一行计数（并书写）文字频率

我使用python从一个文本文件中的每一行数据读取，格式为：

This is a tweet captured from the twitter api #hashtag http://url.com/site

使用NLTK，我可以通过线标记化则可以使用reader.sents（）迭代通过等：

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer()) 

reader.sents()[:10]

但我想进行计数的某些“热词”（存储在数组中或类似的）每行的频率，然后将它们写回一个文本文件。如果我使用reader.words（），我可以计算整个文本中“热门词汇”的频率，但是我正在寻找每行的数量（或本例中的“句子”）。

理想的情况下，这样的：

hotwords = (['tweet'], ['twitter']) 

for each line 
    tokenize into words. 
    for each word in line 
     if word is equal to hotword[1], hotword1 count ++ 
     if word is equal to hotword[2], hotword2 count ++ 
    at end of line, for each hotword[index] 
     filewrite count,

而且，不那么担心URL变得破碎（使用WordPunctTokenizer会删除标点 - 那不是问题）

任何有用的线索（包括伪或链接到其他类似的代码）会很好。

----编辑------------------

结束了做这样的事情：

import nltk 
from nltk.corpus.reader import TaggedCorpusReader 
from nltk.tokenize import LineTokenizer 
#from nltk.tokenize import WordPunctTokenizer 
from collections import defaultdict 

# Create reader and generate corpus from all txt files in dir. 
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus' 
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer()) 
print "Reader accessible." 
print filereader.fileids() 

#define hotwords 
hotwords = ('cool','foo','bar') 

tweetdict = [] 

for line in filereader.sents(): 
wordcounts = defaultdict(int) 
    for word in line: 
     if word in hotwords: 
      wordcounts[word] += 1 
    tweetdict.append(wordcounts)

输出是：

print tweetdict 

[defaultdict(<type 'dict'>, {}), 
defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}), 
defaultdict(<type 'int'>, {'cool': 1})]

来源

2011-04-08 bhalsall

defaultdict是你这种事情的朋友。

from collections import defaultdict 
for line in myfile: 
    # tokenize 
    word_counts = defaultdict(int) 
    for word in line: 
     if word in hotwords: 
      word_counts[word] += 1 
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())

来源

2011-04-08 13:36:48

是的 - 只是稍微调整了这一点，但逻辑是伟大的 - 首选这个柜台解决方案。为文本文件中的每行创建一个defaultdict最有效吗？ – bhalsall 2011-04-08 15:35:58

@bhalsall：你可以在每行之后调用'word_counts.clear（）'，而不是每次创建一个新的defaultdict。 – jfs 2011-04-09 10:13:07

你需要标记它吗？您可以在每行上为每个词使用count()。

hotwords = {'tweet':[], 'twitter':[]} 
for line in file_obj: 
    for word in hotwords.keys(): 
     hotwords[word].append(line.count(word))

来源

2011-04-08 13:25:29 nmichaels

最终你会以其他方式计算的子字符串。如果热门词汇=='性'，我不希望米德尔塞克斯被计数 – 2011-04-08 13:27:50

@Steve：啊，对。 – nmichaels 2011-04-08 13:30:20

这是正确的事情，但。理想情况下，我需要将每一行重新标记为单词。我不能只从一开始就将词汇标记为单词，因为那样我就不会保留新的分隔符（这是分隔每个推文的地方）......我最终计算整个文本文件的词频，而不是每行。 – bhalsall 2011-04-08 13:33:45

from collections import Counter 

hotwords = ('tweet', 'twitter') 

lines = "a b c tweet d e f\ng h i j k twitter\n\na" 

c = Counter(lines.split()) 

for hotword in hotwords: 
    print hotword, c[hotword]

此脚本适用蟒蛇2.7+

来源

2011-04-08 13:37:23 razpeitia

你也可以使用'most_common'像'c.most_common（10）'来获得计数器中最常用的10个单词。 – razpeitia 2011-04-08 13:48:19

我打算建议使用像@Daniel Roseman这样的字典{String word：int count}，但这看起来更光滑。 – Tom 2011-04-08 14:52:14

为文本文件中的每一行计数（并书写）文字频率

回答

相关问题