输入训练或测试文件格式如下:为什么交叉验证用于RandomForestRegressor失败在scikit学习
-1 1 11.10115101|u 11.10115101 |s 2 |reason k:0.116|pv pv1000|g 2230444827 |k k3|w k:0
-1 1 11.10115101|u 11.10115101 |s 0 |reason c:0.080|pv pv1000|g 2235873129 |k k0|w c:1
-1 1 11.10115101|u 11.10115101 |s 1 |reason h:0.054 o:0.073|pv pv1000|g 2236879382 |k k10|w h:1 o:21
-1 1 11.10115101|u 11.10115101 |s 0 |reason u:0.133|pv pv1000|g 2237638819 |k k5|w u:26
-1 1 11.10115101|u 11.10115101 |s 0 |reason o:0.086|pv pv1000|g 2237694729 |k k5|w o:11
-1 1 11.10115101|u 11.10115101 |s 2 |reason l:0.111|pv pv1000|g 2237821631 |k k3|w l:0
的码是作为初级讲座,所述load_data()函数加载训练数据或测试数据进入蟒蛇字典的列表,并返回一个元组([快译通,...],[0,1,0 ...]):
parser = argparse.ArgumentParser()
parser.add_argument('-t', '--train', required = True, help='train file')
parser.add_argument('-e', '--test', required = True, help='test file')
ns = parser.parse_args(sys.argv[1:])
f = open(ns.train)
inputs, targets = load_data(f)
print >>sys.stderr, 'load finish'
vec = DictVectorizer()
train = vec.fit_transform(inputs)
print >>sys.stderr, 'dict vectorizer finish'
print >>sys.stderr, 'training'
clf = RandomForestRegressor()
clf.fit(train.toarray(), targets)
print >>sys.stderr, 'testing'
f = open(ns.test)
test_inputs, test_targets = load_data(f)
test = vec.transform(test_inputs)
print cross_validation.cross_val_score(clf, test.toarray(), test_targets, scoring='roc_auc')
培训工作正常,但这样做交叉验证时,最后一行的代码抛出异常:
File "randomforest.py", line 72, in <module>
print cross_validation.cross_val_score(clf, test.toarray(), test_targets, scoring='roc_auc')
File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1152, in cross_val_score
for train, test in cv)
File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 517, in __call__
self.dispatch(function, args, kwargs)
File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 136, in __init__
self.results = func(*args, **kwargs)
File "/Users/jerry/pkgs/vpy/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1058, in _cross_val_score
y_train = y[train]
TypeError: only integer arrays with one element can be converted to an index
我编写了手动示例中的代码,但失败了。
请始终报告完整回溯。还有什么'test_targets'?它的类型和形状是什么?它是否具有与'test_inputs'变量相同数量的样本?显然这是无效的。 最后,交叉验证是为了在模型选择的开发集上运行。通常在最终评估(测试)集上运行它并不合乎情理。 – ogrisel
对不起,我添加了更多的代码。 – mike
您仍然不提供有关'test_targets'变量性质的任何信息:它是一个numpy数组,是一个python列表,还有其他什么东西?它是一个数组,'.shape'和'.dtype'是什么? – ogrisel