2017-07-01 74 views
1
from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE 
import numpy as np 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 


oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

让我们假设第一列和第二列是分类数据。此代码执行一个热门编码,但对于回归问题,我想在编码分类数据后删除第一列。在这个例子中,有两个,我可以手动完成。但是如果你有很多明确的特征,你会如何解决这个问题呢?使用sklearn的OneHotEncoder去除色谱柱

回答

0

您可以使用numpy的想像力索引和切下的第一列:如果你要删除列的列表

>>> a 
array([[ 1., 0., 0., 1., 0., 0., 100.], 
     [ 0., 1., 0., 0., 1., 0., 200.], 
     [ 0., 0., 1., 0., 0., 1., 400.]]) 
>>> a[:, 1:] 
array([[ 0., 0., 1., 0., 0., 100.], 
     [ 1., 0., 0., 1., 0., 200.], 
     [ 0., 1., 0., 0., 1., 400.]]) 

,这里是你会怎么做:

>>> idx_to_delete = [0, 3] 
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete] 
>>> indices 
[1, 2, 4, 5, 6] 
>>> a[:, indices] 
array([[ 0., 0., 0., 0., 100.], 
     [ 1., 0., 1., 0., 200.], 
     [ 0., 1., 0., 1., 400.]]) 
+0

是的,这会消除第一个分类集的第一列。但是如果我有1000个类别,并且我需要在一个热门编码之后删除每个第一列? – Makaroniiii

+0

这个概念仍然是一样的,你可以像这样扩展到第三个维度:'a [:,:,1:]' –

+0

再次抱歉,但是我收到这个错误:builtins.IndexError:数组索引太多 – Makaroniiii

0

要自动执行此操作,我们会在应用一个热门编码之前,通过识别分类特征中最常用的级别来获取要删除的索引列表。这是因为最常见的水平最能作为基准水平,从而可以评估其他水平的重要性。

应用一个热门编码之后,我们得到要保留的索引列表,并使用它删除先前确定的列。

from sklearn.preprocessing import OneHotEncoder as OHE 
import numpy as np 
import pandas as pd 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 

def get_indices_to_drop(X_before_OH, categorical_indices_list): 
    # Returns list of index to drop after doing one hot encoding 
    # Dropping most common level within the categorical variable 
    # This is because the most common level serves best as the base level, 
    # Allowing the importance of other levels to be evaluated 
    indices_to_drop = [] 
    indices_accum = 0 
    for i in categorical_indices_list: 
     most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0] 
     indices_to_drop.append(most_common + indices_accum) 
     indices_accum += len(np.unique(X_before_OH[:,i])) - 1 
    return indices_to_drop 

indices_to_drop = get_indices_to_drop(a, [0, 1]) 

oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

def get_indices_to_keep(X_after_OH, index_to_drop_list): 
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list] 

indices_to_keep = get_indices_to_keep(a, indices_to_drop) 
a = a[:, indices_to_keep]