2017-04-03 233 views
0

我试图通过使用scikit-learn中的train_test_split函数将我的数据集分成一个训练集和一个测试集,但是我收到此错误:scikit-learn错误:y中人口最少的类只有1个成员

In [1]: y.iloc[:,0].value_counts() 
Out[1]: 
M2 38 
M1 35 
M4 29 
M5 15 
M0 15 
M3 15 

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y) 
Out[2]: 
Traceback (most recent call last): 
    File "run_ok.py", line 48, in <module> 
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y) 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split 
    train, test = next(cv.split(X=arrays[0], y=stratify)) 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split 
    for train, test in self._iter_indices(X, y, groups): 
    File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices 
    raise ValueError("The least populated class in y has only 1" 
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. 

但是,所有类都至少有15个样本。为什么我得到这个错误?

X是一个表示数据点的pandas DataFrame,y是一个包含目标变量的一列pandas DataFrame。

我不能发布原始数据,因为它是专有的,但通过创建具有1k行x 500列的随机熊猫DataFrame(X)和具有相同行数的随机熊猫DataFrame(y) 1k),并为每一行的目标变量(一个分类标签)。 y pandas DataFrame应该有不同的分类标签(例如'class1','class2'...),每个标签至少有15次出现。

+1

您应该发布一个完整的,可复制的代码片段,其中包含错误和数据样本的完整堆栈跟踪。 –

回答

2

问题是train_test_split需要输入2个数组,但y数组是一列矩阵。如果我只通过y的第一列就行了。

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3, 
    random_state=85, stratify=y.iloc[:,1]) 
相关问题