2017-07-28 55 views
2

我需要对包含由comma分隔的字符串列表的列进行二进制转换。python - 包含多个词的列的二进制编码

你能帮助我在及彼:

df = pd.DataFrame({'_id': [1,2,3], 
        'test': [['one', 'two', 'three'], 
          ['three', 'one'], 
          ['four', 'one']]}) 
df 

_id test 
1 [one, two, three] 
2 [three, one] 
3 [four, one] 

到:

df_result = pd.DataFrame({'_id': [1,2,3], 
          'one': [1,1,1], 
          'two': [1,0,0], 
          'three': [1,1,0], 
          'four': [0,0,1]}) 

df_result[['_id', 'one', 'two', 'three', 'four']] 

_id one two three four 
    1 1 1 1  0 
    2 1 0 1  0 
    3 1 0 0  1 

任何帮助将是非常赞赏!

回答

3

您可以使用str.get_dummiespop用于提取塔出来,转化为strstr.join和最后join

df = df.join(df.pop('test').str.join('|').str.get_dummies()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0 

相反pop可以使用drop

df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0 

解决方案与新DataFrame

df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='') 
df = df.join(df1.groupby(level=0, axis=1).max()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0 

我也试着解决方案与astype转换为string,但一些清洁是必要的:

df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies() 
df = df.join(df1) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0 
+0

很酷..非常感谢! – Codutie

1

我们可以用sklearn.preprocessing.MultiLabelBinarizer方法:

from sklearn.preprocessing import MultiLabelBinarizer 

mlb = MultiLabelBinarizer() 

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')), 
          columns=mlb.classes_, 
          index=df.index)) 

结果:

In [15]: df 
Out[15]: 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0