2014-03-31 26 views
0

我想模拟一个以前完成的项目,我遇到了CountVectorizer函数的麻烦。以下是与该问题有关的代码。Python - sklearn - 值错误:空词汇

from __future__ import division 
import nltk, textmining, pprint, re, os.path 
#import numpy as np 
from nltk.corpus import gutenberg 
import fileinput 

list = ["carmilla.txt", "pirate-caribbee.txt", "rider-sage.txt"] 

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 

content=[re.sub(r'[\']', '', text)for text in content] 
content=[re.sub(r'[^\w\s\.]', ' ', text) for text in content] 

print content 

propernouns = [] 
for story in content: 
    propernouns = propernouns+re.findall(r'Mr.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Mrs.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Ms.[\s][\w]+', story) 
    propernouns = propernouns+re.findall(r'Miss.[\s][\w]+', story) 

propernouns = set(propernouns) 
print "\nNumber of proper nouns: " + str(len(propernouns)) 
print "\nExamples from our list of proper nouns: "+str(sorted(propernouns)) 

#Strip all of the above out of text 
for word in propernouns: 
    content = [re.sub(" "+word+" "," ",story) for story in content] 

import string 
content = [story.translate(string.maketrans("",""), "_.")] 

print "\n[2] -----Carmilla Text-----" 
print content 

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1) 
stories_tdm = cv.fit_transform(content).toarray() 

执行此没有完成,我也得到这些错误:

Traceback (most recent call last): 
    File "C:\Users\mnate_000\workspace\de.vogella.python.third\src\TestFile_EDIT.py", line 84, in <module> 
    stories_tdm = cv.fit_transform(content).toarray() 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform 
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary) 
    File "C:\Users\mnate_000\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 727, in _count_vocab 
    raise ValueError("empty vocabulary; perhaps the documents only" 
**ValueError: empty vocabulary; perhaps the documents only contain stop words** 

我不知道哪里去了,因为我已经试过用另一个文件替换“内容”作为测试和它确定我没有使用stopfile ..我似乎无法让它正常运行。有没有人遇到过这个问题?我错过了一些简单的东西吗

回答

0

请记住要正确关闭文件。 f.close()是不存在的,f2.close()不应该缩进,也不应该f1.close()

我认为这可能会解决您的问题。

for l in list: 
    f = open(l) 
    raw1 = f.read() 
    print "<-----Here goes nothing" 
    head = raw1[:680] 
    foot = raw1[157560:176380] 
    content = raw1[680:157560] 
    print "Done---->" 
    f.close() 

...

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
f1.close() 
f2.close() 

编辑 我看到两个问题:

一个是这样的: 含量= [story.translate(string.maketrans( “”, “”),“_.01​​23456789”)]

story变量存在于此缩进级别,所以请求e澄清这一点。

另一个问题是stop_words可能是stringlistNone。在string的情况下,唯一支持的值是'english'。然而,在你的情况,你通过一个文件句柄:

stopfile = open('stopwords2.txt') 
#... 
cv = CountVectorizer(stop_words = stopfile , min_df=1) 

你应该做的是把在stopfile所有的文本字符串列表。 替换此:

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
f2 = open('stopwords2.txt', 'w') 
for line in f1: 
    f2.write(line.replace('\n', ' ')) 
    f1.close() 
    f2.close() 

stopfile = open('stopwords2.txt') 

print "Examples of stopwords: " 
print stopfile.read() 

from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stopfile , min_df=1) 

有了这个:

#Prepare a list of stopwords 
f1 = open('stopwords.txt', 'r') 
stoplist = [] 
for line in f1: 
    nextlist = line.replace('\n', ' ').split() 
    stoplist.extend(nextlist) 
f1.close() 

print "Examples of stopwords: " 
print stoplist 


from sklearn.feature_extraction.text import CountVectorizer 
cv = CountVectorizer(stop_words = stoplist, min_df=1) 
+0

我在'f.close()'添加和调整缩进感谢赶上两个。但是,我仍然遇到同样的问题。 – Dillon

+0

@Dillon:你能告诉我什么'content = [story.translate(string.maketrans(“”,“”),“_.01​​23456789”)]'应该做什么?也就是说,'story'变量来自哪里?在缩进级别上我没有看到“故事”变量。 – AndyG

+0

@Dillon:查看我编辑的帖子 – AndyG