2013-12-23 67 views
5

我正在尝试执行document classification, as described in NLTK Chapter 6,并且无法停用停用词。当我添加NLTK停用词删除问题

all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english')) 

返回

Traceback (most recent call last): 
    File "fiction.py", line 8, in <module> 
    word_features = all_words.keys()[:100] 
AttributeError: 'generator' object has no attribute 'keys' 

我猜测,停止字代码改变用于“all_words”对象的类型,渲染他们的.key()函数没用。如何在不改变其类型的情况下使用键功能前删除停用词?下面的完整代码:

import nltk 
from nltk.corpus import PlaintextCorpusReader 

corpus_root = './nltk_data/corpora/fiction' 
fiction = PlaintextCorpusReader(corpus_root, '.*') 
all_words=nltk.FreqDist(w.lower() for w in fiction.words()) 
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english')) 
word_features = all_words.keys()[:100] 

def document_features(document): # [_document-classify-extractor] 
    document_words = set(document) # [_document-classify-set] 
    features = {} 
    for word in word_features: 
     features['contains(%s)' % word] = (word in document_words) 
    return features 

print document_features(fiction.words('fic/11.txt')) 

回答

4

我会避免将它们添加到FreqDist实例摆在首位做到这一点:

all_words=nltk.FreqDist(w.lower() for w in fiction.words() if w.lower() not in nltk.corpus.stopwords.words('english')) 

根据您的语料库的大小,我认为你可能会得到一个性能提升了创建的禁用词一组这样做之前:

stopword_set = frozenset(ntlk.corpus.stopwords.words('english')) 

如果这不适合您的情况,它看起来像你可以采取发优势ct FreqDist继承自dict

for stopword in nltk.corpus.stopwords.words('english'): 
    if stopword in all_words: 
     del all_words[stopword] 
+0

完美。谢谢! – user3128184