5
我正在尝试执行document classification, as described in NLTK Chapter 6,并且无法停用停用词。当我添加NLTK停用词删除问题
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
返回
Traceback (most recent call last):
File "fiction.py", line 8, in <module>
word_features = all_words.keys()[:100]
AttributeError: 'generator' object has no attribute 'keys'
我猜测,停止字代码改变用于“all_words”对象的类型,渲染他们的.key()函数没用。如何在不改变其类型的情况下使用键功能前删除停用词?下面的完整代码:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './nltk_data/corpora/fiction'
fiction = PlaintextCorpusReader(corpus_root, '.*')
all_words=nltk.FreqDist(w.lower() for w in fiction.words())
all_words = (w for w in all_words if w not in nltk.corpus.stopwords.words('english'))
word_features = all_words.keys()[:100]
def document_features(document): # [_document-classify-extractor]
document_words = set(document) # [_document-classify-set]
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(fiction.words('fic/11.txt'))
完美。谢谢! – user3128184