2016-06-07 147 views
2

我在sci-kit学习中构建了一个线性回归模型,并将输入作为sci-kit学习管道中的预处理步骤进行缩放。有什么办法可以避免缩放二进制列吗?发生的是这些列与其他列进行缩放,导致值集中在0左右,而不是0或1,所以我得到的值如[-0.6,0.3],这导致输入值为0影响我的线性模型中的预测。避免在sci-kit中缩放二进制列学习StandsardScaler

Basic代码来说明:

>>> import numpy as np 
>>> from sklearn.pipeline import Pipeline 
>>> from sklearn.preprocessing import StandardScaler 
>>> from sklearn.linear_model import Ridge 
>>> X = np.hstack((np.random.random((1000, 2)), 
       np.random.randint(2, size=(1000, 2)))) 
>>> X 
array([[ 0.30314072, 0.22981496, 1.  , 1.  ], 
     [ 0.08373292, 0.66170678, 1.  , 0.  ], 
     [ 0.76279599, 0.36658793, 1.  , 0.  ], 
     ..., 
     [ 0.81517519, 0.40227095, 0.  , 0.  ], 
     [ 0.21244587, 0.34141014, 0.  , 0.  ], 
     [ 0.2328417 , 0.14119217, 0.  , 0.  ]]) 
>>> scaler = StandardScaler() 
>>> scaler.fit_transform(X) 
array([[-0.67768374, -0.95108883, 1.00803226, 1.03667198], 
     [-1.43378124, 0.53576375, 1.00803226, -0.96462528], 
     [ 0.90632643, -0.48022732, 1.00803226, -0.96462528], 
     ..., 
     [ 1.08682952, -0.35738315, -0.99203175, -0.96462528], 
     [-0.99022572, -0.56690563, -0.99203175, -0.96462528], 
     [-0.91994001, -1.25618613, -0.99203175, -0.96462528]]) 

我最后一行的输出爱将:

>>> scaler.fit_transform(X, dont_scale_binary_or_something=True) 
array([[-0.67768374, -0.95108883, 1.  , 1.  ], 
     [-1.43378124, 0.53576375, 1.  , 0.  ], 
     [ 0.90632643, -0.48022732, 1.  , 0.  ], 
     ..., 
     [ 1.08682952, -0.35738315, 0.  , 0.  ], 
     [-0.99022572, -0.56690563, 0.  , 0.  ], 
     [-0.91994001, -1.25618613, 0.  , 0.  ]]) 

什么办法可以做到这一点?我想我可以选择不是二进制的列,只是转换它们,然后将转换后的值替换回数组中,但我希望它可以很好地与sci-kit学习Pipeline工作流程,所以我可以这样做:

clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())]) 
clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y) 

回答

3

您应该创建其中忽略最后两列,还能扩展的自定义缩放。

from sklearn.base import TransformerMixin 
import numpy as np 

class CustomScaler(TransformerMixin): 
    def __init__(self): 
     self.scaler = StandardScaler() 

    def fit(self, X, y): 
     self.scaler.fit(X[:, :-2], y) 
     return self 

    def transform(self, X): 
     X_head = self.scaler.transform(X[:, :-2]) 
     return np.concatenate(X_head, X[:, -2:], axis=1) 
3

我发布了代码,我改编自@ miindlek的回应,以防万一它有助于他人。当我没有包含BaseEstimator时遇到错误。再次感谢你@miindlek。在下面,bin_vars_index是二元变量的列索引数组,而cont_vars_index对于要缩放的连续变量是相同的。

from sklearn.preprocessing import StandardScaler 
from sklearn.base import BaseEstimator, TransformerMixin 
import numpy as np 

class CustomScaler(BaseEstimator,TransformerMixin): 
    # note: returns the feature matrix with the binary columns ordered first 
    def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True): 
     self.scaler = StandardScaler(copy,with_mean,with_std) 
     self.bin_vars_index = bin_vars_index 
     self.cont_vars_index = cont_vars_index 

    def fit(self, X, y=None): 
     self.scaler.fit(X[:,self.cont_vars_index], y) 
     return self 

    def transform(self, X, y=None, copy=None): 
     X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy) 
     return np.concatenate((X[:,self.bin_vars_index],X_tail), axis=1) 
2

我已经修改了@J_C代码来处理熊猫数据框。您可以传递想要缩放的列名,并获得具有初始列顺序的结果。

enter code here 
from sklearn.preprocessing import StandardScaler 
from sklearn.base import BaseEstimator, TransformerMixin 
import pandas as pd 

class CustomScaler(BaseEstimator,TransformerMixin): 
    def __init__(self,columns,copy=True,with_mean=True,with_std=True): 
     self.scaler = StandardScaler(copy,with_mean,with_std) 
     self.columns = columns 

    def fit(self, X, y=None): 
     self.scaler.fit(X[self.columns], y) 
     return self 

    def transform(self, X, y=None, copy=None): 
     init_col_order = X.columns 
     X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns) 
     X_not_scaled = X.ix[:,~X.columns.isin(self.columns)] 
     return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order] 

用法:

scale = CustomScaler(columns=['duration', 'num_operations']) 
scaled = scale.fit_transform(churn_d) 
1

我发现@Vitaliy Grabovets数据帧版本的级联,除非你指定X_scaled索引不能正常工作。因此相关行现在读取:

X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns, index=X.index)