创建测试/列车分裂基于两组熊猫Scikit学习

我有一个熊猫数据帧：comb ENROLLED_Response条目的数量是相当小的，整个数据帧的所以只随机抽样可能会失去太多的登记数据。创建测试/列车分裂基于两组熊猫Scikit学习

解决的办法是把所有的条目的75％的样品，其中ENROLLED_Response == True ，然后把所有的条目的70％的样品，其中ENROLLED_Response == False

所以我应该用柱is_train与true结束/ false对数据帧

所以我通常使用类似：

from sklearn.cross_validation import cross_val_score 

#split the dataset for train and test 
comb['is_train'] = np.random.uniform(0, 1, len(comb)) <= .75 
train, test = comb[comb['is_train']==True], comb[comb['is_train']==False]

这是适用于大多数情况下，但由于入学人数少，这种方法往往会遗漏出太多的'入学'，因为只有这么少。所以我需要的是更类似于：

comb['is_train'] = train_test_split(comb['ENROLLED_Response']==True, Train_size = 0.75) 
comb['is_train']= train_test_split(comb['ENROLLED_Response']==False, Train_size = 0.75)

哪个当然不起作用。这个概念是：第一个样本注册并将其中的一个随机标记为75，然后对未注册的（其他所有内容）进行抽样，并将它们中的75个标记为train，放在同一个新列（is_train）中，以便它可以在Scikit_learn容易使用，如：

train, test = comb[comb['is_train']==True],comb[comb['is_train']==False]

无法弄清楚如何做到这一点，因为由随机产生的NP阵列相对于整个数据帧的长度（等问题...）

来源

2015-11-19 dartdog

评论后更新时间：

import pandas as pd 
import numpy as np 

np.random.seed(42) 

truePct = 0.75 
falsePct = 0.70 

comb = pd.DataFrame({ 
    "feat1": np.random.randint(low=1, high=4, size=20), 
    "feat2": np.random.randint(low=1, high=4, size=20), 
    "ENROLLED_Response": np.random.randint(low=0, high=4, size=20)==3 
}) 

# Set train to False by default 
comb['train'] = False 

# Create two permutations for both classes 
nTrue = comb[comb.ENROLLED_Response==True].shape[0] 
nFalse = comb[comb.ENROLLED_Response==False].shape[0] 
truePerm = np.random.permutation(nTrue)[:int(np.floor(truePct*nTrue)-1)] 
falsePerm = np.random.permutation(nFalse)[:int(np.floor(falsePct*nFalse)-1)] 

# Select the indices 
trainTrueIndex = comb[comb.ENROLLED_Response==True].index[truePerm].values.tolist() 
trainFalseIndex = comb[comb.ENROLLED_Response==False].index[falsePerm].values.tolist() 

comb.loc[trainTrueIndex,'train'] = True 
comb.loc[trainFalseIndex,'train'] = True 

print comb

结果

ENROLLED_Response feat1 feat2 train 
0    False  3  1 False 
1    False  1  1 False 
2    False  3  2 False 
3    False  3  2 False 
4    False  1  1 True 
5    False  1  1 True 
6    False  3  1 True 
7    True  2  3 True 
8    False  3  3 False 
9    True  3  3 True 
10    False  3  2 True 
11    False  3  3 True 
12    False  1  2 True 
13    False  3  2 True 
14    False  2  3 False 
15    False  1  2 True 
16    False  2  3 True 
17    True  2  3 False 
18    True  2  1 False 
19    False  2  3 True

我并不完全相信我正确地解释你的问题，但似乎你正在处理您的ENROLLED_Response变量类不平衡。为了保持在两列列车─你可能要使用不同的Scikit，了解交叉验证功能测试集该类不平衡：StratifiedShuffleSplit

此功能是StratifiedKFold和ShuffleSplit，返回分层随机折叠的合并。通过保留每个班级的样本百分比来制作折叠。

来源

2015-11-20 18:12:03 luckylwk

我认为你的解释正确，但我认为我的内存约束阻止了我从这个方法，我有大约350,000行......所以我一直在调用诸如“shuffleSplit”之类的东西时遇到内存问题，因此为什么我'试图找出如何处理更直接.. – dartdog

也许更新的答案有帮助... – luckylwk

看起来不错，将不得不等待一些标记完成，需要测试和离开的一周（是啊！）谢谢 – dartdog

创建测试/列车分裂基于两组熊猫Scikit学习

回答

相关问题