2016-03-04 75 views
2

我做一些多元文本分类,它为我的工作需要好:scikit学习得到的分类/分分类的确定性已选定类别

classifier = Pipeline([ 
    ('vect', CountVectorizer(tokenizer=my_tokenizer, stop_words=stopWords, ngram_range=(1, 2), min_df=2)), 
    ('tfidf', TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)), 
    ('clf', MultinomialNB(alpha=0.01, fit_prior=True))]) 

categories = [list of my possible categories] 

# Learning 

news = [list of news already categorized] 
news_cat = [the category of the corresponding news] 

news_target_cat = numpy.searchsorted(categories, news_cat) 

classifier = classifier.fit(news, news_target_cat) 

# Categorizing 

news = [list of news not yet categorized] 

predicted = classifier.predict(news) 

for i, pred_cat in enumerate(predicted): 
    print(news[i]) 
    print(categories[pred_cat]) 

现在,我想有预测类别是预测变量的“确定性”(例如:0.0 - >“我已经掷出骰子来选择一个类别”,高达1.0 - >“没有什么会改变我对新闻类别的看法”)。我应该如何获得该类别的确定性值/预测变量的分数?

回答

2

如果您需要类别probability之类的东西,您必须使用分类器的predict_proba()方法。

Docs

+0

非常感谢!我没有在文档中看到它:-( – Cabu