2016-08-02 118 views
3

我正在做一个分类任务。不过,我得到的结果稍有不同:如何在scikit-learn中计算correclty交叉验证分数?

#First Approach 
kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=False) 
pipe= make_pipeline(SVC()) 
for train_index, test_index in kf: 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index] 

print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision'))) 



#Second Approach 
clf.fit(X_train,y_train) 
y_pred = clf.predict(X_test) 
print ('Precision:', precision_score(y_test, y_pred,average='binary')) 

#Third approach 
pipe= make_pipeline(SCV()) 
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision'))) 

#Fourth approach 

pipe= make_pipeline(SVC()) 
print('Precision',np.mean(cross_val_score(pipe, X_train, y_train, cv=kf, scoring='precision'))) 

日期:

Precision: 0.780422106837 
Precision: 0.782051282051 
Precision: 0.801544091998 

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch) 
    1431            train, test, verbose, None, 
    1432            fit_params) 
-> 1433      for train, test in cv) 
    1434  return np.array(scores)[:, 0] 
    1435 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable) 
    798    # was dispatched. In particular this covers the edge 
    799    # case of Parallel used with an exhausted iterator. 
--> 800    while self.dispatch_one_batch(iterator): 
    801     self._iterating = True 
    802    else: 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator) 
    656     return False 
    657    else: 
--> 658     self._dispatch(tasks) 
    659     return True 
    660 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch) 
    564 
    565   if self._pool is None: 
--> 566    job = ImmediateComputeBatch(batch) 
    567    self._jobs.append(job) 
    568    self.n_dispatched_batches += 1 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch) 
    178   # Don't delay the application, to avoid keeping the input 
    179   # arguments in memory 
--> 180   self.results = batch() 
    181 
    182  def get(self): 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self) 
    70 
    71  def __call__(self): 
---> 72   return [func(*args, **kwargs) for func, args, kwargs in self.items] 
    73 
    74  def __len__(self): 

/usr/local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0) 
    70 
    71  def __call__(self): 
---> 72   return [func(*args, **kwargs) for func, args, kwargs in self.items] 
    73 
    74  def __len__(self): 

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score) 
    1522  start_time = time.time() 
    1523 
-> 1524  X_train, y_train = _safe_split(estimator, X, y, train) 
    1525  X_test, y_test = _safe_split(estimator, X, y, test, train) 
    1526 

/usr/local/lib/python3.5/site-packages/sklearn/cross_validation.py in _safe_split(estimator, X, y, indices, train_indices) 
    1589     X_subset = X[np.ix_(indices, train_indices)] 
    1590   else: 
-> 1591    X_subset = safe_indexing(X, indices) 
    1592 
    1593  if y is not None: 

/usr/local/lib/python3.5/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices) 
    161         indices.dtype.kind == 'i'): 
    162    # This is often substantially faster than X[indices] 
--> 163    return X.take(indices, axis=0) 
    164   else: 
    165    return X[indices] 

IndexError: index 900 is out of bounds for size 900 

所以,我的问题是上述方法是正确的计算cross validated metrics我相信我的分数是被污染的,因为我对什么时候执行交叉验证感到困惑。因此,关于如何正确执行交叉验证分数的任何想法?

UPDATE

在训练步骤评估?

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = False) 
clf = make_pipeline(SVC()) 
# However, fot clf, you can use whatever estimator you like 
kf = StratifiedKFold(y = y_train, n_folds=10, shuffle=True, random_state=False) 
scores = cross_val_score(clf, X_train, y_train, cv = kf, scoring='precision') 
print('Mean score : ', np.mean(scores)) 
print('Score variance : ', np.var(scores)) 

回答

4

对于任何分类任务,它总是很好用StratifiedKFold交叉验证拆分。在分层KFold中,您的分类问题中每个类的样本数量相同。

StratifiedKFold

那就要看你的类型分类问题。它总是很高兴看到精确度和召回分数。在倾斜的二元分类的情况下,人们倾向于使用ROC AUC分数:

from sklearn import metrics 
metrics.roc_auc_score(ytest, ypred) 

让我们看一下你的解决方案:

import numpy as np 
from sklearn.cross_validation import cross_val_score 
from sklearn.metrics import precision_score 
from sklearn.cross_validation import KFold 
from sklearn.pipeline import make_pipeline 
from sklearn.svm import SVC 

np.random.seed(1337) 

X = np.random.rand(1000,5) 

y = np.random.randint(0,2,1000) 

kf = KFold(n=len(y), n_folds=10, shuffle=True, random_state=42) 
pipe= make_pipeline(SVC(random_state=42)) 
for train_index, test_index in kf: 
    X_train, X_test = X[train_index], X[test_index] 
    y_train, y_test = y[train_index], y[test_index] 

print ('Precision',np.mean(cross_val_score(pipe, X_train, y_train, scoring='precision'))) 
# Here you are evaluating precision score on X_train. 

#Second Approach 
clf = SVC(random_state=42) 
clf.fit(X_train,y_train) 
y_pred = clf.predict(X_test) 
print ('Precision:', precision_score(y_test, y_pred, average='binary')) 

# here you are evaluating precision score on X_test 

#Third approach 
pipe= make_pipeline(SVC()) 
print('Precision',np.mean(cross_val_score(pipe, X, y, cv=kf, scoring='precision'))) 

# Here you are splitting the data again and evaluating mean on each fold 

因此,结果是不同的

+0

感谢您的帮助,关于到' cross_val_score',你知道我为什么得到不同的结果吗?哪个是计算它们的正确方法? –

+1

它可能是由于随机种子。你是否尝试设置random_state参数并查看会发生什么? –

+0

是的,我的随机我甚至开始之前做了以下:'np.random.seed(1337)' –

3

首先,解释在documentation中显示,并在examples中显示,scikit-learn交叉验证cross_val_score执行以下操作:

  1. 将您的数据集X拆分N倍(根据参数cv)。它相应地拆分标签y
  2. 使用估计器(参数estimator)在N-1个以前的折叠上进行训练。
  3. 使用估计器来预测最后一个折叠的标签。
  4. 通过比较预测值和真值,返回得分(参数scoring
  5. 通过更改测试倍数重复步骤2至步骤4。因此,你最终得到一组N分数。

让我们来看看您的每种方法。

第一种方法:

你为什么会分裂训练​​cross_validation之前作为scikit学习功能,它会为你设置?因此,你在训练较少的数据模型,具有值得验证结束比分

第二种方法:

在这里,你用你的数据的另一个指标比cross_validation_sore。因此,您无法将其与其他验证分数进行比较 - 因为它们是两个不同的东西。一个是经典的错误百分比,而precision是用于校准二进制分类器的指标(true或false)。尽管这是一个很好的指标(您可以检查ROC曲线,精确度和召回指标),但只比较这些指标。

第三种方法:

这一个是更自然的。这个分数是好的一(我的意思是如果你想比较它与其他分类器/估计)。不过,我会警告你直接采取平均的结果。因为有两件事可以比较:平均值和方差。阵列的每个得分是与其他不同,你可能想知道多少,比其他估计(你一定希望你的方差越小越好)

第四种方法:

似乎有要与不相关的cross_val_score

最后Kfold一个问题:

只能使用第二OR比较估计量的第三种方法。但他们绝对不会估计同样的事情 - 精度与错误率。

clf = make_pipeline(SVC()) 
# However, fot clf, you can use whatever estimator you like 
scores = cross_val_score(clf, X, y, cv = 10, scoring='precision') 
print('Mean score : ', np.mean(scores)) 
print('Score variance : ', np.var(scores)) 

通过改变clf到另一个估计(或将其集成到一个循环),您将能够为每个eastimator分数,并比较他们

+0

感谢您的帮助,关于第三种方法我不明白为什么我不能这样做:'cross_val_score(pipe,X_train,y_train,cv = kf,scoring ='precision')',我得到了以下例外: IndexError:索引900超出900的大小,为什么会发生这种情况? –

+2

kf位于整个集合上,即len(y),并且您正在使用X_train,y_train。 –

+1

是的,绝对:)我认为你想要做的是'cv = 10'(而不是'cv = kf')。 –