如何为scikit-learn分类器获取最丰富的功能？

在机器学习包，比如liblinear的分类和NLTK提供了一个方法show_most_informative_features()，这是调试功能真的有用：如何为scikit-learn分类器获取最丰富的功能？

viagra = None   ok : spam  =  4.5 : 1.0 
hello = True   ok : spam  =  4.5 : 1.0 
hello = None   spam : ok  =  3.3 : 1.0 
viagra = True   spam : ok  =  3.3 : 1.0 
casino = True   spam : ok  =  2.0 : 1.0 
casino = None   ok : spam  =  1.5 : 1.0

我的问题是，如果类似的东西是在scikit学习的分类实施。我搜查了文档，但找不到类似的东西。

如果还没有这样的功能，有人知道如何获得这些值的解决方法吗？

非常感谢！

来源

2012-06-20 tobigue

你是指最具歧视性的参数？ – Simon

我不确定你的意思是什么参数。我的意思是最挑剔的功能，如在袋的词模型的垃圾邮件分类，哪些词给每个类的最证据。不是我所理解的“设置”分类的参数 - 就像学率等 – tobigue

@eowl：在机器学习的说法，*参数*是通过基于学习过程*特点*你的训练集产生的设置。学习率等是超参数*。 –

随着larsmans代码的帮助下，我来到了这个代码，二进制的情况：

def show_most_informative_features(vectorizer, clf, n=20): 
    feature_names = vectorizer.get_feature_names() 
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) 
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) 
    for (coef_1, fn_1), (coef_2, fn_2) in top: 
     print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

来源

2012-06-21 14:55:49 tobigue

谢谢，正是我需要的！ – WeaselFox

如何从main方法调用函数？ f1和f2代表什么？我试图用scikit-learn从决策树分类器中调用函数。 – 2014-03-30 20:37:35

这段代码只适用于具有'coef_'数组的线性分类器，所以不幸的是我不认为可以将它与sklearn的决策树分类器一起使用。 'fn_1'和'fn_2'代表特征名称。 – tobigue

分类器本身不记录功能名称，它们只是看到数字数组。但是，如果您使用Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer，和您使用的是线性模型（例如LinearSVC或朴素贝叶斯）提取的功能，那么你可以使用同样的伎俩是，document classification example用途。实施例（未测试，可以包含一个错误或两个）：

def print_top10(vectorizer, clf, class_labels): 
    """Prints features with the highest coefficient values, per class""" 
    feature_names = vectorizer.get_feature_names() 
    for i, class_label in enumerate(class_labels): 
     top10 = np.argsort(clf.coef_[i])[-10:] 
     print("%s: %s" % (class_label, 
       " ".join(feature_names[j] for j in top10)))

这是为多类分类;对于二进制情况，我认为你应该只使用clf.coef_[0]。您可能需要对class_labels进行排序。

来源

2012-06-20 09:51:55

是的，在我的情况下，我只有两个班，但与您的代码我能够拿出我想要的东西。非常感谢！ – tobigue

@eowl：不客气。你有'coef_'的'np.abs'吗？因为获得最高价值的系数只会返回指示正面类的特征。 –

某事就像那样......我对列表进行了排序，并将头部和尾部分开，这使您仍然可以看到什么类的特征票。我发布我的解决方案[下]（http://stackoverflow.com/a/11140887/979377）。 – tobigue

RandomForestClassifier还没有一个coef_ attrubute，但它会在0.17版本中，我想。但是，请参阅Recursive feature elimination on Random Forest using scikit-learn中的RandomForestClassifierWithCoef类。这可能会给你一些想法来解决上述限制。

来源

2015-07-28 18:35:13

你也可以做这样的事情的，以创造的重要特征图：

importances = clf.feature_importances_ 
std = np.std([tree.feature_importances_ for tree in clf.estimators_], 
     axis=0) 
indices = np.argsort(importances)[::-1] 

# Print the feature ranking 
#print("Feature ranking:") 


# Plot the feature importances of the forest 
plt.figure() 
plt.title("Feature importances") 
plt.bar(range(train[features].shape[1]), importances[indices], 
    color="r", yerr=std[indices], align="center") 
plt.xticks(range(train[features].shape[1]), indices) 
plt.xlim([-1, train[features].shape[1]]) 
plt.show()

来源

2016-08-01 14:55:15 Oleole

要添加的更新，RandomForestClassifier现在支持.feature_importances_属性。这个attribute告诉你有多少观察到的差异是由该特征解释的。显然，所有这些值的总和必须< = 1

执行功能的工程，当我发现这个属性是非常有用的。

感谢scikit-learn团队和贡献者的实施！

编辑：这既适用于随机森林和GradientBoosting。所以RandomForestClassifier，RandomForestRegressor，GradientBoostingClassifier和GradientBoostingRegressor都支持这一点。

来源

2016-08-13 07:31:42 ClimbsRocks

我们最近发布了一个库（https://github.com/TeamHG-Memex/eli5），它可以做到这一点：它处理variuos分类从scikit学习，二进制/多类案件，可以根据特征值来突出显示文本，用IPython中集成等

来源

2016-11-24 17:42:54

如何为scikit-learn分类器获取最丰富的功能？

回答

相关问题