2017-07-25 77 views
12

我想,因为有独特的元素来分解选自元素列表的成许多列的熊猫柱即one-hot-encode他们(具有值1表示不存在的情况下存在于行和0一个给定的元素)。如何从包含列表的熊猫列中进行一次热编码?

例如,以数据帧DF

Col1 Col2   Col3 
C  33  [Apple, Orange, Banana] 
A  2.5 [Apple, Grape] 
B  42  [Banana] 

我想将其转换为:

DF

Col1 Col2 Apple Orange Banana Grape 
C  33  1  1  1  0 
A  2.5 1  0  0  1 
B  42  0  0  1  0 

如何使用熊猫/ sklearn实现这个?

回答

15

我们也可以使用sklearn.preprocessing.MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer 

mlb = MultiLabelBinarizer() 
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')), 
          columns=mlb.classes_, 
          index=df.index)) 

结果:

In [77]: df 
Out[77]: 
    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 
+1

你可能会发现有趣的时间。 – piRSquared

6

使用get_dummies

df_out = df.assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 

输出:

Col1 Col2      Col3 Apple Banana Grape Orange 
0 C 33.0 [Apple, Orange, Banana]  1  1  0  1 
1 A 2.5   [Apple, Grape]  1  0  1  0 
2 B 42.0     [Banana]  0  1  0  0 

清理柱:

df_out.drop('Col3',axis=1) 

输出:

Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 
+1

+1与'get_dummies'使用''**,但是这可能是因为'.STACK()'和方法链的大dataframes缓慢。 –

+0

@BradSolomon谢谢。 –

+0

我不确定这是否正常工作...尝试后:'df = pd.concat([df,df])' – Alexander

5

你可以通过Col3环路与apply,每个元素转换成一系列的列表作为成为结果数据帧的报头中的指标:

pd.concat([ 
     df.drop("Col3", 1), 
     df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 

#Col1 Col2 Apple Banana Grape Orange 
#0 C 33.0  1.0  1.0 0.0  1.0 
#1 A 2.5  1.0  0.0 1.0  0.0 
#2 B 42.0  0.0  1.0 0.0  0.0 
5

你可以在Col3所有独特的水果使用设定的理解如下:

set(fruit for fruits in df.Col3 for fruit in fruits) 

使用字典理解,然后你可以去通过每一个独特的水果,看看它是否在列。

>>> df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3] 
            for fruit in set(fruit for fruits in df.Col3 
                for fruit in fruits)}) 
    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 

时序

dfs = pd.concat([df] * 1000) # Use 3,000 rows in the dataframe. 

# Solution 1 by @Alexander (me) 
%%timeit -n 1000 
dfs[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in dfs.Col3] 
           for fruit in set(fruit for fruits in dfs.Col3 for fruit in fruits)}) 
# 10 loops, best of 3: 4.57 ms per loop 

# Solution 2 by @Psidom 
%%timeit -n 1000 
pd.concat([ 
     dfs.drop("Col3", 1), 
     dfs.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 
# 10 loops, best of 3: 748 ms per loop 

# Solution 3 by @MaxU 
from sklearn.preprocessing import MultiLabelBinarizer 
mlb = MultiLabelBinarizer() 

%%timeit -n 10 
dfs.join(pd.DataFrame(mlb.fit_transform(dfs.Col3), 
          columns=mlb.classes_, 
          index=dfs.index)) 
# 10 loops, best of 3: 283 ms per loop 

# Solution 4 by @ScottBoston 
%%timeit -n 10 
df_out = dfs.assign(**pd.get_dummies(dfs.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 
# 10 loops, best of 3: 512 ms per loop 

But... 
>>> print(df_out.head()) 
    Col1 Col2      Col3 Apple Banana Grape Orange 
0 C 33.0 [Apple, Orange, Banana] 1000 1000  0 1000 
1 A 2.5   [Apple, Grape] 1000  0 1000  0 
2 B 42.0     [Banana]  0 1000  0  0 
0 C 33.0 [Apple, Orange, Banana] 1000 1000  0 1000 
1 A 2.5   [Apple, Grape] 1000  0 1000  0 
10

选项1
简短回答
pir_slow

df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies()) 

    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 

选项2
快速回答
pir_fast

v = df.Col3.values 
l = [len(x) for x in v.tolist()] 
f, u = pd.factorize(np.concatenate(v)) 
n, m = len(v), u.size 
i = np.arange(n).repeat(l) 

dummies = pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m), 
    df.index, u 
) 

df.drop('Col3', 1).join(dummies) 

    Col1 Col2 Apple Orange Banana Grape 
0 C 33.0  1  1  1  0 
1 A 2.5  1  0  0  1 
2 B 42.0  0  0  1  0 

选项3
pir_alt1

df.drop('Col3', 1).join(
    pd.get_dummies(
     pd.DataFrame(df.Col3.tolist()).stack() 
    ).astype(int).sum(level=0) 
) 

    Col1 Col2 Apple Orange Banana Grape 
0 C 33.0  1  1  1  0 
1 A 2.5  1  0  0  1 
2 B 42.0  0  0  1  0 

时序结果
代码下面

enter image description here


def maxu(df): 
    mlb = MultiLabelBinarizer() 
    d = pd.DataFrame(
     mlb.fit_transform(df.Col3.values) 
     , df.index, mlb.classes_ 
    ) 
    return df.drop('Col3', 1).join(d) 


def bos(df): 
    return df.drop('Col3', 1).assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 

def psi(df): 
    return pd.concat([ 
     df.drop("Col3", 1), 
     df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 

def alex(df): 
    return df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3] 
             for fruit in set(fruit for fruits in df.Col3 
                 for fruit in fruits)}) 

def pir_slow(df): 
    return df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies()) 

def pir_alt1(df): 
    return df.drop('Col3', 1).join(pd.get_dummies(pd.DataFrame(df.Col3.tolist()).stack()).astype(int).sum(level=0)) 

def pir_fast(df): 
    v = df.Col3.values 
    l = [len(x) for x in v.tolist()] 
    f, u = pd.factorize(np.concatenate(v)) 
    n, m = len(v), u.size 
    i = np.arange(n).repeat(l) 

    dummies = pd.DataFrame(
     np.bincount(i * m + f, minlength=n * m).reshape(n, m), 
     df.index, u 
    ) 

    return df.drop('Col3', 1).join(dummies) 

results = pd.DataFrame(
    index=(1, 3, 10, 30, 100, 300, 1000, 3000), 
    columns='maxu bos psi alex pir_slow pir_fast pir_alt1'.split() 
) 

for i in results.index: 
    d = pd.concat([df] * i, ignore_index=True) 
    for j in results.columns: 
     stmt = '{}(d)'.format(j) 
     setp = 'from __main__ import d, {}'.format(j) 
     results.set_value(i, j, timeit(stmt, setp, number=10)) 
+1

真是太棒了! PS我刚刚使用了我今天的最后投票镜头;-) – MaxU

+0

@MaxU谢谢你( - : – piRSquared

+0

太快了!就像你的时序图一样,我假设* x轴*是数据框中的行数? – Alexander