使用sklearn的OneHotEncoder去除色谱柱

from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE 
import numpy as np 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 


oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray()

让我们假设第一列和第二列是分类数据。此代码执行一个热门编码，但对于回归问题，我想在编码分类数据后删除第一列。在这个例子中，有两个，我可以手动完成。但是如果你有很多明确的特征，你会如何解决这个问题呢？使用sklearn的OneHotEncoder去除色谱柱

来源

2017-07-01 Makaroniiii

您可以使用numpy的想像力索引和切下的第一列：如果你要删除列的列表

>>> a 
array([[ 1., 0., 0., 1., 0., 0., 100.], 
     [ 0., 1., 0., 0., 1., 0., 200.], 
     [ 0., 0., 1., 0., 0., 1., 400.]]) 
>>> a[:, 1:] 
array([[ 0., 0., 1., 0., 0., 100.], 
     [ 1., 0., 0., 1., 0., 200.], 
     [ 0., 1., 0., 0., 1., 400.]])

，这里是你会怎么做：

>>> idx_to_delete = [0, 3] 
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete] 
>>> indices 
[1, 2, 4, 5, 6] 
>>> a[:, indices] 
array([[ 0., 0., 0., 0., 100.], 
     [ 1., 0., 1., 0., 200.], 
     [ 0., 1., 0., 1., 400.]])

来源

2017-07-01 19:14:15

是的，这会消除第一个分类集的第一列。但是如果我有1000个类别，并且我需要在一个热门编码之后删除每个第一列？ – Makaroniiii

这个概念仍然是一样的，你可以像这样扩展到第三个维度：'a [:,：，1：]' –

再次抱歉，但是我收到这个错误：builtins.IndexError：数组索引太多 – Makaroniiii

要自动执行此操作，我们会在应用一个热门编码之前，通过识别分类特征中最常用的级别来获取要删除的索引列表。这是因为最常见的水平最能作为基准水平，从而可以评估其他水平的重要性。

应用一个热门编码之后，我们得到要保留的索引列表，并使用它删除先前确定的列。

from sklearn.preprocessing import OneHotEncoder as OHE 
import numpy as np 
import pandas as pd 

a = np.array([[0,1,100],[1,2,200],[2,3,400]]) 

def get_indices_to_drop(X_before_OH, categorical_indices_list): 
    # Returns list of index to drop after doing one hot encoding 
    # Dropping most common level within the categorical variable 
    # This is because the most common level serves best as the base level, 
    # Allowing the importance of other levels to be evaluated 
    indices_to_drop = [] 
    indices_accum = 0 
    for i in categorical_indices_list: 
     most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0] 
     indices_to_drop.append(most_common + indices_accum) 
     indices_accum += len(np.unique(X_before_OH[:,i])) - 1 
    return indices_to_drop 

indices_to_drop = get_indices_to_drop(a, [0, 1]) 

oh = OHE(categorical_features=[0,1]) 
a = oh.fit_transform(a).toarray() 

def get_indices_to_keep(X_after_OH, index_to_drop_list): 
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list] 

indices_to_keep = get_indices_to_keep(a, indices_to_drop) 
a = a[:, indices_to_keep]

来源

2017-09-18 01:11:23 tnbalankura

使用sklearn的OneHotEncoder去除色谱柱

回答

相关问题