如何从语料库中获取最频繁的单词？

我正在使用语料库，并希望从语料库中获得最多和最少使用的单词和词类。我有一个代码的开始，但我得到一些错误，我不知道如何处理。我想从棕色语料库中获取最常用的词，然后是最常用的词类。我有这样的代码：如何从语料库中获取最频繁的单词？

import re 
import nltk 
import string 
from collections import Counter 
from nltk.corpus import stopwords 
from collections import defaultdict, Counter 
from nltk.corpus import brown 

brown = nltk.corpus.brown 
stoplist = stopwords.words('english') 

from collections import defaultdict 

def toptenwords(brown): 
    words = brown.words() 
    no_capitals = ([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist] 
    translate_table = dict((ord(char), None) for char in string.punctuation) 
    no_punct = [s.translate(translate_table) for s in filtered] 
    wordcounter = defaultdict(int) 
    for word in no_punct: 
     if word in wordcounter: 
      wordcounter[word] += 1 
     else: 
      wordcounter[word] = 1 
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)] 
    return sorting 

print(toptenwords(brown)) 

words_2 = [word[0] for word in brown.tagged_words(categories="news")] 
# the most frequent words 
print Counter(words_2).most_common(10) 

words_2 = [word[1] for word in brown.tagged_words(categories="news")] 
# the most frequent word class 
print Counter(words_2).most_common(10) 


# Keeps words and pos into a dictionary 
# where the key is a word and 
# the value is a counter of POS and counts 
word_tags = defaultdict(Counter) 
for word, pos in brown.tagged_words(): 
word_tags[word][pos] +=1 

# To access the POS counter. 
print 'Red', word_tags['Red'] 
print 'Marlowe', word_tags['Marlowe'] 
print 

# Greatest number of distinct tag. 
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0] 

print word_with_most_distinct_pos 
print word_tags[word_with_most_distinct_pos] 
print len(word_tags[word_with_most_distinct_pos]) 

# which word has the greatest number of distinct tags 
word_tags_2 = nltk.defaultdict(lambda: set()) 
for word, token in tagged_words: 
    word_tags[word].add(token) 
    ambig_words = sorted([(k, len(v)) for (k, v) in word_tags.items()]), 
    key=itemgetter(1), reverse=True)[:50] 
    print [(word, numtoks, word_tags[word]) for (word, numtoks) in ambig_words]

当我运行上面的代码，我收到以下错误：

File "Oblig2a.py", line 64 
    key=itemgetter(1), reverse=True)[:50] 
          ^
SyntaxError: invalid syntax

从这个代码，我想获得：

最常说的一句话
最常用单词分类
最不频繁单词分类
与多个字班有多少个字
哪个词有最标签，有多少不同的标签是有
我需要帮助的最后一件事是一个函数写一个特定的词，写就怎么很多时候它会出现在每个标签中。我试图做到这一点上面，但我不能得到它的工作...

它是数字3，4，5和6我需要帮助... 任何帮助将是非常受欢迎的。

来源

2017-03-03 Vebjørn Bergaplass

看堆栈跟踪。违规行显然是'stoplist = stopwords.words（brown）'。此方法需要文件ID，但不是一系列标记的单词（这是您分配给变量“brown”的内容）。 – lenz

我该如何改变它？ –

您应该为该功能提供该语言的名称，例如'stoplist = stopwords.words（'english'）' –

有3个问题的代码：

错误什么的解释是告诉你 - 你应该到停止字功能提供的语言名称：stoplist = stopwords.words('english')
使用defaultdict字典get方法正确排序字典： [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)]
上使用Unicode数据转换表，见string.translate() with unicode data in python
布朗标记词在FO元组RMAT (word, part-of-speech)

完整代码：

import re 
import nltk 
import string 
from collections import Counter 
from nltk.corpus import stopwords 

brown = nltk.corpus.brown 
stoplist = stopwords.words('english') 

from collections import defaultdict 

def toptenwords(brown): 
    words = brown.words() 
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist] 
    translate_table = dict((ord(char), None) for char in string.punctuation) 
    no_punct = [s.translate(translate_table) for s in filtered] 
    wordcounter = defaultdict(int) 
    for word in no_punct: 
     if word in wordcounter: 
      wordcounter[word] += 1 
     else: 
      wordcounter[word] = 1 
    sorting = [(k, wordcounter[k])for k in sorted(wordcounter, key = wordcounter.get, reverse = True)] 
    return sorting 


print(toptenwords(brown)) 

words_2 = [word[0] for word in brown.tagged_words(categories="news")] 
# the most frequent words 
print Counter(words_2).most_common(10) 

words_2 = [word[1] for word in brown.tagged_words(categories="news")] 
# the most frequent word class 
print Counter(words_2).most_common(10)

来源

2017-03-03 10:00:27

谢谢！但是，我如何从这段代码中得到最少使用的单词和单词类？ –

检查此主题http：// stackoverflow。com/questions/4743035/python-3-1-obtain-the-least-common-elements-array –

我试过了，但是我没有得到我想要的输出。 –

如何从语料库中获取最频繁的单词？

回答

相关问题