TfIdf矩阵为BernoulliNB返回错误的特征数量

使用python lib sklearn，我尝试从训练集中提取特征并用这些数据拟合BernoulliNB分类器。TfIdf矩阵为BernoulliNB返回错误的特征数量

分类器未经训练后，我想要预测（分类）一些新的测试数据。不幸的是我得到这个错误：

Traceback (most recent call last): 
File "sentiment_analysis.py", line 45, in <module> main() 
File "sentiment_analysis.py", line 41, in main 
    prediction = classifier.predict(tfidf_data) 
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict 
    jll = self._joint_log_likelihood(X) 
File "\Python27\lib\site-packages\sklearn\naive_bayes.py", line 724, in _joint_log_likelihood 
    % (n_features, n_features_X)) 
ValueError: Expected input with 4773 features, got 13006 instead

这是我的代码：

#Train the Classifier 
data,target = load_file('validation/validation_set_5.csv') 
tf_idf = preprocess(data) 
classifier = BernoulliNB().fit(tf_idf, target) 

#Predict test data 
count_vectorizer = CountVectorizer(binary='true') 
test = count_vectorizer.fit_transform(test) 
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test) 
prediction = classifier.predict(tfidf_data)

来源

2015-10-05 fsteinbauer

这就是为什么你有这样的错误：

test = count_vectorizer.fit_transform(test) 
tfidf_data = TfidfTransformer(use_idf=False).fit_transform(test)

你应该在这里只使用旧变压器（CountVectorizer和TfidfTransformer是你的变形金刚）装在火车上。

fit_transform

意味着你适合在新集这些变压器，失去约老适合所有信息，然后转换“测试”这个变压器（新样本教训，并与不同的功能集）。因此它将测试集转换为新的一组特征，与训练集中使用的旧特征不兼容。为了解决这个问题，你应该在旧的变形金刚上使用transform（not fit_transform）方法，它适合于训练集。

你应该写类似：

test = old_count_vectorizer.transform(test) 
tfidf_data = old_tfidf_transformer.transform(test)

来源

2015-10-05 10:42:23

TfIdf矩阵为BernoulliNB返回错误的特征数量

回答

相关问题