2017-05-29 202 views
0

我正在尝试使用Isolation Forest sklearn implementation来训练包含357个特征的数据集。当max features变量设置为1.0(默认值)时,我可以成功地训练并获得结果。在使用sklearn的隔离森林算法中设置最大特征参数的错误

然而,当最大的特征被设置为2,它提供了以下错误:

ValueError: Number of features of the model must match the input. 
Model n_features is 2 and input n_features is 357 

enter image description here

它还给出了同样的错误,当该特征计数为1(int)和不是1.0(浮动)。

我的理解是,当特征数为2(int)时,应该在创建每棵树时考虑两个特征。这是错的吗?我如何更改最大特征参数?

的代码如下:

from sklearn.ensemble.iforest import IsolationForest 

def isolation_forest_imp(dataset): 

    estimators = 10 
    samples = 100 
    features = 2 
    contamination = 0.1 
    bootstrap = False 
    random_state = None 
    verbosity = 0 

    estimator = IsolationForest(n_estimators=estimators, max_samples=samples, contamination=contamination, 
            max_features=features, 
            bootstrap=boostrap, random_state=random_state, verbose=verbosity) 

    model = estimator.fit(dataset) 
+1

这是一个在scikit 0.18或更低版本中的问题。请参阅[问题](https://github.com/scikit-learn/scikit-learn/issues/5732)。更新你的scikit版本为0.20 –

+0

谢谢@VivekKumar,这似乎是问题。 – Fleur

回答

0

在它指出的文档:

max_features : int or float, optional (default=1.0) 
    The number of features to draw from X to train each base estimator. 

     - If int, then draw `max_features` features. 
     - If float, then draw `max_features * X.shape[1]` features. 

所以,2应该意味着采取两个特征和1.0应意味着采取所有的特征,0.5取从我所了解的情况来看,一半等等。

我认为这可能是一个错误,因为采取在IsolationForest的配合来看看:

# Isolation Forest inherits from BaseBagging 
# and when _fit is called, BaseBagging takes care of the features correctly 
super(IsolationForest, self)._fit(X, y, max_samples, 
              max_depth=max_depth, 
              sample_weight=sample_weight) 
# however, when after _fit the decision_function is called using X - the whole sample - not taking into account the max_features 
self.threshold_ = -sp.stats.scoreatpercentile(
      -self.decision_function(X), 100. * (1. - self.contamination)) 

则:

# when the decision function _validate_X_predict is called, with X unmodified, 
    # it calls the base estimator's (dt) _validate_X_predict with the whole X 
    X = self.estimators_[0]._validate_X_predict(X, check_input=True) 

    ... 

    # from tree.py: 
    def _validate_X_predict(self, X, check_input): 
     """Validate X whenever one tries to predict, apply, predict_proba""" 
     if self.tree_ is None: 
      raise NotFittedError("Estimator not fitted, " 
           "call `fit` before exploiting the model.") 

     if check_input: 
      X = check_array(X, dtype=DTYPE, accept_sparse="csr") 
      if issparse(X) and (X.indices.dtype != np.intc or 
           X.indptr.dtype != np.intc): 
       raise ValueError("No support for np.int64 index based " 
           "sparse matrices") 
     # so, this check fails because X is the original X, not with the max_features applied 
     n_features = X.shape[1] 
     if self.n_features_ != n_features: 
      raise ValueError("Number of features of the model must " 
          "match the input. Model n_features is %s and " 
          "input n_features is %s " 
          % (self.n_features_, n_features)) 

     return X 

所以,我不知道你如何处理这个问题。也许找出导致您需要的两个功能的百分比 - 尽管我不确定它会按预期工作。

注:我使用scikit学习v.0.18

编辑:作为@Vivek库马尔评论说,这是一个问题,升级到0.20应该做的伎俩。