python - 包含多个词的列的二进制编码

我需要对包含由comma分隔的字符串列表的列进行二进制转换。python - 包含多个词的列的二进制编码

你能帮助我在及彼：

df = pd.DataFrame({'_id': [1,2,3], 
        'test': [['one', 'two', 'three'], 
          ['three', 'one'], 
          ['four', 'one']]}) 
df 

_id test 
1 [one, two, three] 
2 [three, one] 
3 [four, one]

到：

df_result = pd.DataFrame({'_id': [1,2,3], 
          'one': [1,1,1], 
          'two': [1,0,0], 
          'three': [1,1,0], 
          'four': [0,0,1]}) 

df_result[['_id', 'one', 'two', 'three', 'four']] 

_id one two three four 
    1 1 1 1  0 
    2 1 0 1  0 
    3 1 0 0  1

任何帮助将是非常赞赏！

来源

2017-07-28 Codutie

您可以使用str.get_dummies，pop用于提取塔出来，转化为str由str.join和最后join：

df = df.join(df.pop('test').str.join('|').str.get_dummies()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0

相反pop可以使用drop：

df = df.drop('test', axis=1).join(df.pop('test').str.join('|').str.get_dummies()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0

解决方案与新DataFrame：

df1 = pd.get_dummies(pd.DataFrame(df.pop('test').values.tolist()), prefix='', prefix_sep='') 
df = df.join(df1.groupby(level=0, axis=1).max()) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0

我也试着解决方案与astype转换为string，但一些清洁是必要的：

df1=df.pop('test').astype(str).str.strip("'[]").str.replace("',\s+'", '|').str.get_dummies() 
df = df.join(df1) 
print (df) 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0

来源

2017-07-28 08:59:23 jezrael

很酷..非常感谢！ – Codutie

我们可以用sklearn.preprocessing.MultiLabelBinarizer方法：

from sklearn.preprocessing import MultiLabelBinarizer 

mlb = MultiLabelBinarizer() 

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('test')), 
          columns=mlb.classes_, 
          index=df.index))

结果：

In [15]: df 
Out[15]: 
    _id four one three two 
0 1  0 1  1 1 
1 2  0 1  1 0 
2 3  1 1  0 0

来源

2017-07-28 09:13:36 MaxU

python - 包含多个词的列的二进制编码

回答

相关问题