Python - sklearn - 值错误：空词汇

我想模拟一个以前完成的项目，我遇到了CountVectorizer函数的麻烦。以下是与该问题有关的代码。Python - sklearn - 值错误：空词汇

from __future__ import division 
import nltk, textmining, pprint, re, os.path 
#import numpy as np 
from nltk.corpus import gutenberg 
import fileinput 

list = ["carmilla.txt", "pirate-caribbee.txt", "rider-sage.txt"] 

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 

content=[re.sub(r'[\']', '', text)for text in content] 
content=[re.sub(r'[^\w\s\.]', ' ', text) for text in content] 

print content 

propernouns = [] 
for story in content: 
    propernouns = propernouns+re.findall(r'Mr.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Mrs.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Ms.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Miss.[\s][\w]+', story) 

propernouns = set(propernouns) 
print "\nNumber of proper nouns: " + str(len(propernouns)) 
print "\nExamples from our list of proper nouns: "+str(sorted(propernouns)) 

#Strip all of the above out of text 
for word in propernouns: 
    content = [re.sub(" "+word+" "," ",story) for story in content] 

import string 
content = [story.translate(string.maketrans("",""), "_.")] 

print "\n[2] -----Carmilla Text-----" 
print content 

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1) 
stories_tdm = cv.fit_transform(content).toarray()

执行此没有完成，我也得到这些错误：

Traceback (most recent call last): 
    File "C:\Users\mnate_000\workspace\de.vogella.python.third\src\TestFile_EDIT.py", line 84, in <module> 
    stories_tdm = cv.fit_transform(content).toarray() 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform 
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 727, in _count_vocab 
    raise ValueError("empty vocabulary; perhaps the documents only" 
**ValueError: empty vocabulary; perhaps the documents only contain stop words**

我不知道哪里去了，因为我已经试过用另一个文件替换“内容”作为测试和它确定我没有使用stopfile ..我似乎无法让它正常运行。有没有人遇到过这个问题？我错过了一些简单的东西吗

来源

2014-03-31 Dillon

请记住要正确关闭文件。 f.close()是不存在的，f2.close()不应该缩进，也不应该f1.close()

我认为这可能会解决您的问题。

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 
    f.close()

...

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
f1.close() 
f2.close()

编辑我看到两个问题：

一个是这样的：含量= [story.translate（string.maketrans（ “”， “”），“_.0123456789”）]

否story变量存在于此缩进级别，所以请求e澄清这一点。

另一个问题是stop_words可能是string，list或None。在string的情况下，唯一支持的值是'english'。然而，在你的情况，你通过一个文件句柄：

stopfile = open('stopwords2.txt') 
#... 
cv = CountVectorizer(stop_words = stopfile , min_df=1)

你应该做的是把在stopfile所有的文本字符串列表。替换此：

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1)

有了这个：

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
stoplist = [] 
for line in f1: 
    nextlist = line.replace('\n', ' ').split() 
    stoplist.extend(nextlist) 
f1.close() 

print "Examples of stopwords: " 
print stoplist 


from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stoplist, min_df=1)

来源

2014-03-31 19:41:32 AndyG

我在'f.close（）'添加和调整缩进感谢赶上两个。但是，我仍然遇到同样的问题。 – Dillon

@Dillon：你能告诉我什么'content = [story.translate（string.maketrans（“”，“”），“_.0123456789”）]'应该做什么？也就是说，'story'变量来自哪里？在缩进级别上我没有看到“故事”变量。 – AndyG

@Dillon：查看我编辑的帖子 – AndyG

Python - sklearn - 值错误：空词汇

回答

相关问题