2017-06-29 34 views
0

我有一个数据帧df,看起来像这样:拆分数据框随机(依赖于唯一值)

| A | B | ... | 
--------------------- 
| one | ... | ... | 
| one | ... | ... | 
| one | ... | ... | 
| two | ... | ... | 
| three | ... | ... | 
| three | ... | ... | 
| four | ... | ... | 
| five | ... | ... | 
| five | ... | ... | 

正如你可以看到A有5个独特的价值。我想随机分割DataFrame。例如,我想在DataFrame df1中使用3个唯一值,并在DataFrame df2中使用2个唯一值。我的问题是他们不是独一无二的。我不想通过两个DataFrame分割这些独特的值。

所以导致数据框看起来是这样的:

数据帧df1与3个独特的价值观:

| A | B | ... | 
--------------------- 
| one | ... | ... | 
| one | ... | ... | 
| one | ... | ... | 
| three | ... | ... | 
| three | ... | ... | 
| five | ... | ... | 
| five | ... | ... | 

数据帧df2 2个独特的价值观:

| A | B | ... | 
--------------------- 
| two | ... | ... | 
| four | ... | ... | 

反正是有如何轻松实现这一点?我想到了分组,但我不知道如何从这个斯普利特...

+1

你将有独特的一个因素提取到一个列表,然后拆分此列表分为2所列出,然后选择您的基于2个列表的数据帧。 –

回答

1

设置

df=pd.DataFrame({'A': {0: 'one', 
    1: 'one', 
    2: 'one', 
    3: 'two', 
    4: 'three', 
    5: 'three', 
    6: 'four', 
    7: 'five', 
    8: 'five'}, 
'B': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}}) 

解决方案

#get 2 unique keys from column A for df1. You can control the split either 
# by absolute number in each group, or by a percentage. Check docs for the .sample() func. 
df1_keys = df.A.drop_duplicates().sample(2) 
df1 = df[df.A.isin(df1_keys)] 
#anything not in df1_keys will be assigned to df2 
df2 = df[~df.A.isin(df1_keys)] 

df1_keys 
Out[294]: 
7 five 
0  one 
Name: A, dtype: object 

df1 
Out[295]: 
     A B 
0 one 0 
1 one 1 
2 one 2 
7 five 7 
8 five 8 

df2 
Out[296]: 
     A B 
3 two 3 
4 three 4 
5 three 5 
6 four 6 
1
v = df1['A'].unique() # Get the unique values 
np.shuffle(v) # Shuffle them 
v1,v2 = np.array_split(v,2) # Split the unique values into two arrays 

最后,指数使用.isin()方法来获得期望的结果你的数据帧。

r1 = df[df['A'].isin(v1)] 
r2 = df[df['A'].isin(v2)]