Python将数据拆分为随机集合

我想将我的数据拆分为两个随机集合。我已经做了第一部分：Python将数据拆分为随机集合

ind = np.random.choice(df.shape[0], size=[int(df.shape[0]*0.7)], replace=False) 
X_train = df.iloc[ind]

现在我想选择所有指数”不ind创建我的测试集。请你能告诉我该怎么做？

我认为这将是

X_test = df.iloc[-ind]

但显然它不是

来源

2017-05-29 jlt199

所以你想选择70％作为测试数据，其余30％作为训练数据？一个更简单的方法可能是使用np.random.shuffle来混洗索引，并使用前70％的混洗索引作为训练和休息作为测试。 –

是的，这正是我想要的 – jlt199

试试这个纯Python的方法。

ind_inversed = list(set(range(df.shape[0])) - set(ind)) 
X_test = df.iloc[ind_inversed]

来源

2017-05-29 15:48:07

这不会随机化这两组 –

因为我认为'ind'的计算方式与原始问题相同。 'ind_inversed'表示不在'ind'中的所有其他indecies。 –

你说得对，对不起！ –

退房scikit-learntest_train_split()

从文档实例：

>>> import numpy as np 
>>> from sklearn.model_selection import train_test_split 
>>> X, y = np.arange(10).reshape((5, 2)), range(5) 
>>> X 
array([[0, 1], 
     [2, 3], 
     [4, 5], 
     [6, 7], 
     [8, 9]]) 
>>> list(y) 
[0, 1, 2, 3, 4] 

>>> 

>>> X_train, X_test, y_train, y_test = train_test_split(
...  X, y, test_size=0.33, random_state=42) 
... 
>>> X_train 
array([[4, 5], 
     [0, 1], 
     [6, 7]]) 
>>> y_train 
[2, 0, 3] 
>>> X_test 
array([[2, 3], 
     [8, 9]]) 
>>> y_test 
[1, 4]

你的情况，你可以做这样的：

larger, smaller = test_train_split(df, test_size=0.3)

来源

2017-05-29 15:49:16

另一种方式来获得一个70 - 30列车测试拆分将产生指标，随机洗牌，然后sp点燃70 - 30份。

ind = np.arange(df.shape[0]) 
np.random.shuffle(ind) 
X_train = df.iloc[ind[:int(0.7*df.shape[0])],:] 
X_test = df.iloc[ind[int(0.7*df.shape[0]):],:]

我建议转换pandas.dataframe为数字矩阵，并使用scikit学习的train_test_split做拆分，除非你真的想这样做这样。

来源

2017-05-29 15:54:38

我喜欢这种方法。谢谢。我之前使用过'train_test_split'（尽管我已经忘记了它），但是我发现数据更易于在数据框中进行检查和可视化。 – jlt199

Python将数据拆分为随机集合

回答

相关问题