2017-05-08 69 views
2

如果我想要一个随机火车/测试分裂,我用的是sklearn辅助函数:如何获得sklearn非洗牌train_test_split

In [1]: from sklearn.model_selection import train_test_split 
    ...: train_test_split([1,2,3,4,5,6]) 
    ...: 
Out[1]: [[1, 6, 4, 2], [5, 3]] 

什么是最简洁的方式来获得一个非改组的列车/测试分裂,即

[[1,2,3,4], [5,6]] 

编辑目前我使用

train, test = data[:int(len(data) * 0.75)], data[int(len(data) * 0.75):] 

但希望有更好的东西。我已经打开了sklearn https://github.com/scikit-learn/scikit-learn/issues/8844

EDIT 2个问题:我的PR已经被合并,在scikit学习版本0.19,您可以shuffle=False传递参数给train_test_split获得非改组的分裂。

回答

3

使用numpy.split

import numpy as np 
data = np.array([1,2,3,4,5,6]) 

np.split(data, [4])   # modify the index here to specify where to split the array 
# [array([1, 2, 3, 4]), array([5, 6])] 

如果您想按百分比分割,则可以从数据的形状计算分裂指数:

data = np.array([1,2,3,4,5,6]) 
p = 0.6 

idx = int(p * data.shape[0]) + 1  # since the percentage may end up to be a fractional 
             # number, modify this as you need, usually shouldn't 
             # affect much if data is large 
np.split(data, [idx]) 
# [array([1, 2, 3, 4]), array([5, 6])] 
+0

谢谢,这几乎看起来像我想要的但如果我不知道我想吐的价值?即说我只想做一个60/40分割? – maxymoo

+0

嗯是的我希望能避免这样的事情,但也许是不可能在这种情况下,你认为它可能会更清楚,只要做'data [:int(len(data)* p)],data [int(len(数据)* p):]' – maxymoo

+0

是的。这绝对有效。 – Psidom

4

我不加入除了一个容易复制粘贴功能除了Psidom的答案:

def non_shuffling_train_test_split(X, y, test_size=0.2): 
    i = int((1 - test_size) * X.shape[0]) + 1 
    X_train, X_test = np.split(X, [i]) 
    y_train, y_test = np.split(y, [i]) 
    return X_train, X_test, y_train, y_test 

更新: 在某些时候,这个功能变得内置的,所以现在你可以这样做:

from sklearn.model_selection import train_test_split 
train_test_split(X, y, test_size=0.2, shuffle=False) 
1

所有你需要做的就是将洗牌参数为False,分层参数设置为无:

In [49]: train_test_split([1,2,3,4,5,6],shuffle = False, stratify = None) 
    Out[49]: [[1, 2, 3, 4], [5, 6]] 
+1

嘿实际上mayank' stratify = None'是默认的(请参阅原始问题中的“编辑2”) – maxymoo