RandomForestClassifier的性能差

我编写了以下Python代码，用于在UCI ML回购（使用默认参数设置）的Forest CoverType数据集上运行RandomForestClassifier。然而，结果非常差，准确率在60％左右，而这项技术应该能够达到90％以上（例如Weka）。我已经尝试将n_estimators增加到100，但这并没有带来太多的改进。RandomForestClassifier的性能差

关于我能做些什么来获得更好的结果，在scikit-learn中使用这种技术有什么想法，或者可能是这种糟糕性能的原因？

from sklearn.datasets import fetch_covtype 
    from sklearn.ensemble import RandomForestClassifier 
    from sklearn import cross_validation 


    covtype = fetch_covtype() 
    clf = RandomForestClassifier() 
    scores = cross_validation.cross_val_score(clf, covtype.data, covtype.target) 
    print scores 

[ 0.5483831 0.58210057 0.61055001]

来源

2016-07-05 Bart Goethals

你可以尝试以下操作来提高你的分数： -

火车上的所有提供给您的属性模型。它会过度训练，但它会让你知道你在训练集上可以达到多少准确度。
下一页使用clf.feature_importances_
使用网格搜索CV调整超参数模型下降最少的重要特征。使用交叉验证和oob_score（超出分数）来更好地估计泛化。

来源

2016-07-05 09:40:43

您是否获得90％的相同数据集和相同的估计值？由于数据集之间的用于训练的数据子集

第11340条记录分裂

用于验证数据的下一个3780个记录子集

最后的565892条记录用于测试的数据子集

和文档要求以下性能，这使您的未调整的随机森林不那么差：

70％的神经网络（反向传播）

58％线性判别分析

至于n_estimators等于100，你可以增加多达500个，1.000甚至更多。检查每个结果并在分数开始稳定时保留该数字。

问题可能来自Weka的默认超参数与Scikit-Learn相比。您可以调整其中一些以改善结果：

max_features用于在每个树节点上分割的要素数。
max_depth也许模型overfits了一下你的训练数据通过获取太深
min_samples_split，min_samples_leaf，min_weight_fraction_leaf和max_leaf_nodes涉及样本的枝叶间重新划分 - 何时让他们，还是不行。

您也可以尝试通过组合它们或通过减小尺寸来处理您的功能。

你应该有kaggle脚本来看看如here被他们描述了如何获得78％与ExtraTreesClassifier（然而，训练集包含了11.340 + 3780个recors - 他们似乎使用更高一些n_estimators虽然

来源

2016-07-05 10:05:44

我设法用GridSearchCV

from sklearn.datasets import fetch_covtype 
from sklearn.ensemble import RandomForestClassifier 
from sklearn import cross_validation 
from sklearn import grid_search 
import numpy as np 


covtype = fetch_covtype() 
clf = RandomForestClassifier() 

X_train, X_test, y_train, y_test = cross_validation.train_test_split(covtype.data, 
                    covtype.target, 
                    test_size=0.33, 
                    random_state=42) 
params = {'n_estimators':[30, 50, 100], 
      'max_features':['sqrt', 'log2', 10]} 
gsv = grid_search.GridSearchCV(clf, params, cv=3, 
           n_jobs=-1, scoring='f1') 
gsv.fit(X_train, y_train) 

print metrics.classification_report(y_train, gsv.best_estimator_.predict(X_train)) 

print metrics.classification_report(y_test, gsv.best_estimator_.predict(X_test))

输出让你的模型很好的改善：

  precision recall f1-score support 

      1  1.00  1.00  1.00 141862 
      2  1.00  1.00  1.00 189778 
      3  1.00  1.00  1.00  24058 
      4  1.00  1.00  1.00  1872 
      5  1.00  1.00  1.00  6268 
      6  1.00  1.00  1.00  11605 
      7  1.00  1.00  1.00  13835 

avg/total  1.00  1.00  1.00 389278 

      precision recall f1-score support 

      1  0.97  0.95  0.96  69978 
      2  0.95  0.97  0.96  93523 
      3  0.95  0.96  0.95  11696 
      4  0.92  0.86  0.89  875 
      5  0.94  0.78  0.86  3225 
      6  0.94  0.90  0.92  5762 
      7  0.97  0.95  0.96  6675 

avg/total  0.96  0.96  0.96 191734

这是不是太遥远的Kaggle leaderboard分数（请注意，Kaggle比赛采用的是更具挑战性的数据拆分，但！）

如果你想看到更多的改进，那么你将不得不考虑的不平课程以及如何最好地选择您的培训数据。

注意

我用估计的数量较少比我会通常以节省时间，但是在训练集中表现不错的机型，所以你可能没有考虑这一点。

我使用了一小部分max_features，因为通常这会减少模型训练中的偏差。虽然这并非总是如此。

我用f1得分，因为我不太了解数据集，并且f1在分类问题上工作得很好。

来源

2016-07-06 13:46:21 ncfirth

我试过你的代码，并且还打印出了n_estimators = 100和max_features = 10的最佳参数（best_params_）。然后，我调整我的代码以使用这些参数，并且还添加了参数scoring ='f1_weighted'。不幸的是，我仍然得到同样糟糕的结果。任何想法？ clf = RandomForestClassifier（n_estimators = 100，max_features = 10） scores = cross_validation.cross_val_score（clf，covtype.data，covtype.target，scoring ='f1_weighted'） –

RandomForestClassifier的性能差

回答

相关问题