sklearn岭和sample_weight给出内存错误

我试图运行使用样本权重的阵列的简单Sklearn岭回归。 X_train是由100 2D numpy的阵列〜200K。我尝试使用sample_weight选项时出现内存错误。没有这个选项，它工作得很好。为了简单起见，我将特征减少到2，并且sklearn仍然会引发内存错误。任何想法？sklearn岭和sample_weight给出内存错误

model=linear_model.Ridge() 

model.fit(X_train, y_train,sample_weight=w_tr) 

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 449, in fit 
    return super(Ridge, self).fit(X, y, sample_weight=sample_weight) 
    File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 338, in fit 
    solver=self.solver) 
    File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.py", line 286, in ridge_regression 
    K = safe_sparse_dot(X, X.T, dense_output=True) 
    File "/home/g/anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 83, in safe_sparse_dot 
    return np.dot(a, b) 
MemoryError

来源

2014-03-31 ADJ

设置样本权重会导致sklearn linear_model Ridge对象处理您的数据的方式存在很大差异 - 尤其是在矩阵较高（n_samples> n_features）的情况下，您的情况更是如此。如果没有样本权重将利用以下事实：X.T.dot（X）是一个相对较小矩阵（100×100在您的情况），因此将反转特征空间的矩阵。对于给定的样本权重，Ridge对象决定停留在样本空间中（为了能够单独对样本进行加权，请参阅相关行here和here以分支到样本空间中的_solve_dense_cholesky_kernel），因此需要将矩阵大小相同X.dot（XT）的（在你的情况是N_SAMPLES次X N_SAMPLES次= 200000 X 200000和将导致存储器错误甚至创建之前）。这实际上是一个实施问题，请参阅下面的手动解决方法。

TL; DR：岭对象是无法治疗的特征空间样本的权重，并会生成一个矩阵N_SAMPLES次X N_SAMPLES次，这将导致你的内存错误

在等待可能的补救办法中scikit学习，你可以尝试明确解决在功能空间的问题，像这样

import numpy as np 
alpha = 1. # You did not specify this in your Ridge object, but it is the default penalty for the Ridge object 
sample_weights = w_tr.ravel() # make sure this is 1D 
target = y.ravel() # make sure this is 1D as well 
n_samples, n_features = X.shape 
coef = np.linalg.inv((X.T * sample_weights).dot(X) + 
         alpha * np.eye(n_features)).dot(sample_weights * target)

对于新样本X_new，你的预测是

prediction = np.dot(X_new, coef)

为了确认这种方法的有效性，你可以从你的代码，它适用于样本数量较少时比较这些COEF到model.coef_（后你有适合的型号）（例如， 300），与Ridge对象一起使用时不会导致内存错误。

重要：只以上的代码与sklearn实现，如果你的数据已经集中，即您的数据必须具有均值为0实现一个完整的岭回归的截距拟合这里就等于一个贡献scikit学习重合，所以最好发帖there。居中数据的方法如下：

X_mean = X.mean(axis=0) 
target_mean = target.mean() # Assuming target is 1d as forced above

然后，您使用所提供的代码上

X_centered = X - X_mean 
target_centered = target - target_mean

有关新数据的预测，您需要

prediction = np.dot(X_new - X_mean, coef) + target_mean

编辑：作为2014年4月15日，scikit-learn岭回归可以处理这个问题（出血边缘代码）。它将在0.15版本中提供。

来源

2014-04-02 11:56:21 eickenberg

感谢@ogrisel为我指出sklearn线性模型以数据为中心这一事实 – eickenberg

[此增强建议]（https://github.com/scikit-learn/scikit-learn/pull/3034）实现了解释的功能以上。 – eickenberg

scikit学习的最新版本现在支持特征空间中的样本权重。 – eickenberg

你安装了什么NumPy版本？

看起来像最终的方法调用，它是numpy.dot(X, X.T)，如果在你的情况下X.shape = (200000,2)会产生一个200k乘200k的矩阵。

尝试将你的观察到稀疏矩阵型或减少所使用的观测次数（也有可能是使用在一个时间一些看法一批岭回归的变体？）。

来源

2014-04-01 13:09:06

sklearn岭和sample_weight给出内存错误

回答

相关问题