2017-07-17 99 views
0

我正在使用SciKit-Learn 0.18.1和Python 2.7进行一些基本的机器学习。我试图通过交叉验证来评估我的模型有多好。当我这样做:SciKit-Learn:交叉验证的结果非常不同

from sklearn.cross_validation import cross_val_score, KFold 

cv = KFold(n=5, random_state = 100) 

clf = RandomForestRegressor(n_estimators=400, max_features = 0.5, verbose = 2, max_depth=30, min_samples_leaf=3) 
score = cross_val_score(estimator = clf, X = X, y = y, cv = cv, n_jobs = -1, 
         scoring = "neg_mean_squared_error") 
avg_score = np.mean([np.sqrt(-x) for x in score]) 
std_dev = y.std() 
print "avg_score: {}, std_dev: {}, avg_score/std_dev: {}".format(avg_score, std_dev, avg_score/std_dev) 

我得到一个低avg_score(〜9K)。

令人不安的是,尽管指定了5次折叠,但我的score数组中只有3个项目。相反,当我这样做:

from sklearn.model_selection import KFold, cross_val_score 

并运行相同的代码(除n成为n_splits),我得到一个更糟糕的方式RMSE(〜24K)。

任何想法这里发生了什么?

谢谢!

回答

1
cv = KFold(n=5, random_state = 100) 

根据http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.model_selection.KFold n是示例和n_folds总数,默认为3,是CV的数目褶皱。看起来你只用3倍和5个例子来运行CV,这可能是造成这种差异的原因。 也许改变nn_folds

+0

请注意,我第一次做'从sklearn.cross_validation进口cross_val_score,KFold'所以它应该是'N' – bclayman

+0

在这种情况下,不是n实例的数量和n_folds数的褶皱? –

+0

此外,http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html#sklearn.model_selection.KFold让我觉得sklearn.cross_validation.KFold已弃用 –