NLTK停用词删除问题

我正在尝试执行document classification, as described in NLTK Chapter 6，并且无法停用停用词。当我添加NLTK停用词删除问题

all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))

Traceback (most recent call last): 
    File "fiction.py", line 8, in <module> 
    word_features = all_words.keys()[:100] 
AttributeError: 'generator' object has no attribute 'keys'

我猜测，停止字代码改变用于“all_words”对象的类型，渲染他们的.key（）函数没用。如何在不改变其类型的情况下使用键功能前删除停用词？下面的完整代码：

import nltk 
from nltk.corpus import PlaintextCorpusReader 

corpus_root = './nltk_data/corpora/fiction' 
fiction = PlaintextCorpusReader(corpus_root, '.*') 
all_words=nltk.FreqDist(w.lower() for w in fiction.words()) 
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english')) 
word_features = all_words.keys()[:100] 

def document_features(document): # [_document-classify-extractor] 
    document_words = set(document) # [_document-classify-set] 
    features = {} 
    for word in word_features: 
     features['contains(%s)' % word] = (word in document_words) 
    return features 

print document_features(fiction.words('fic/11.txt'))

来源

2013-12-23 user3128184

我会避免将它们添加到FreqDist实例摆在首位做到这一点：

all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english'))

根据您的语料库的大小，我认为你可能会得到一个性能提升了创建的禁用词一组这样做之前：

stopword_set = frozenset(ntlk.corpus.stopwords.words('english'))

如果这不适合您的情况，它看起来像你可以采取发优势ct FreqDist继承自dict：

for stopword in nltk.corpus.stopwords.words('english'): 
    if stopword in all_words: 
     del all_words[stopword]

来源

2013-12-23 01:04:46

完美。谢谢！ – user3128184

NLTK停用词删除问题

回答

相关问题