2016-10-26 103 views
3

我想在RandomForestClassifier执行GridSearchCV,但数据是不均衡的,所以我用StratifiedKFold:GridSearchCV与StratifiedKFold

from sklearn.model_selection import StratifiedKFold 
from sklearn.grid_search import GridSearchCV 
from sklearn.ensemble import RandomForestClassifier 

param_grid = {'n_estimators':[10, 30, 100, 300], "max_depth": [3, None], 
      "max_features": [1, 5, 10], "min_samples_leaf": [1, 10, 25, 50], "criterion": ["gini", "entropy"]} 

rfc = RandomForestClassifier() 

clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train) 

但我得到一个错误:

TypeError         Traceback (most recent call last) 
<ipython-input-597-b08e92c33165> in <module>() 
    9 rfc = RandomForestClassifier() 
    10 
---> 11 clf = GridSearchCV(rfc, param_grid=param_grid, cv=StratifiedKFold()).fit(X_train, y_train) 

c:\python34\lib\site-packages\sklearn\grid_search.py in fit(self, X, y) 
    811 
    812   """ 
--> 813   return self._fit(X, y, ParameterGrid(self.param_grid)) 

c:\python34\lib\site-packages\sklearn\grid_search.py in _fit(self, X, y, parameter_iterable) 
    559          self.fit_params, return_parameters=True, 
    560          error_score=self.error_score) 
--> 561     for parameters in parameter_iterable 
    562     for train, test in cv) 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable) 
    756    # was dispatched. In particular this covers the edge 
    757    # case of Parallel used with an exhausted iterator. 
--> 758    while self.dispatch_one_batch(iterator): 
    759     self._iterating = True 
    760    else: 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator) 
    601 
    602   with self._lock: 
--> 603    tasks = BatchedCalls(itertools.islice(iterator, batch_size)) 
    604    if len(tasks) == 0: 
    605     # No more tasks available in the iterator: tell caller to stop. 

c:\python34\lib\site-packages\sklearn\externals\joblib\parallel.py in __init__(self, iterator_slice) 
    125 
    126  def __init__(self, iterator_slice): 
--> 127   self.items = list(iterator_slice) 
    128   self._size = len(self.items) 

c:\python34\lib\site-packages\sklearn\grid_search.py in <genexpr>(.0) 
    560          error_score=self.error_score) 
    561     for parameters in parameter_iterable 
--> 562     for train, test in cv) 
    563 
    564   # Out is a list of triplet: score, estimator, n_test_samples 

TypeError: 'StratifiedKFold' object is not iterable 

当我写cv=StratifiedKFold(y_train)我有ValueError: The number of folds must be of Integral type.但是当我写`cv = 5时,它可以工作。

我不明白什么是错的StratifiedKFold

回答

0

API中的最新版本的改变。您曾经传递y,现在只需在创建分层Klfold对象时传递数字即可。你以后通过y。

+0

我写'CV = StratifiedKFold(10)'和得到'类型错误: 'StratifiedKFold' 对象不是iterable'何时应该套印Y? – user183897

+0

在当前版本中导入sklearn.model_selection.StratifiedKFold。然后你可以做cv = StratifiedKFold(10),应该没有错误。但是,也许你是从前面的模块导入,为了兼容目的,它仍然存在,直到版本20为止。 – simon

+0

我可以再问一个问题吗?我从这个网站下载http://www.lfd.uci.edu/~gohlke/pythonlibs/#scikit-learn文件scikit_learn-0.18-cp34-cp34m-win32.whl,安装它,但现在我得到了'ImportError:DLL加载失败:%1不是有效的Win32应用程序。 '。哪里不对? – user183897

0

似乎cv=StratifiedKFold()).fit(X_train, y_train)应改为cv=StratifiedKFold()).split(X_train, y_train).

+0

这与错误无关。这条线:clf = GridSearchCV(rfc,param_grid = param_grid,cv = StratifiedKFold())。fit(X_train,y_train)只是定义了对象clf,然后它调用fit方法来训练/适应clf。 – sera

+0

@ rll还提到,适合应该被拆分取代。 – ebrahimi

0

这里的问题是一个API的变化在其他的答案中提到,但答案可能会更加明确。

cv参数文档状态:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 3-fold cross-validation, integer, to specify the number of folds.

  • An object to be used as a cross-validation generator.

  • An iterable yielding train/test splits.

For integer/None inputs, if y is binary or multiclass, StratifiedKFold used. If the estimator is a classifier or if y is neither binary nor multiclass, KFold is used.

所以,无论cross validation strategy使用,所有需要的是使用功能split提供发电机,作为建议:

kfolds = StratifiedKFold(5) 
clf = GridSearchCV(estimator, parameters, scoring=qwk, cv=kfolds.split(xtrain,ytrain)) 
clf.fit(xtrain, ytrain) 
2

我完全一样的问题。

为我工作的解决方案是取代

from sklearn.grid_search import GridSearchCV 

from sklearn.model_selection import GridSearchCV 

那么它应该工作的罚款。

0

在'0.18.1'版本的Sklearn。

GridSearchCV(estimator, param=param_grid, c=5)

实现具有5个分割一个StratifiedKFold。

文档:

> cv : int, cross-validation generator or an iterable, optional 
>   Determines the cross-validation splitting strategy. 
>   Possible inputs for cv are: 
>   - None, to use the default 3-fold cross validation, 
>   - integer, to specify the number of folds in a `(Stratified)KFold`, 
>   - An object to be used as a cross-validation generator. 
>   - An iterable yielding train, test splits. 
相关问题