识别从1D异常值的块和2D数据在Python

数据：我有一个数据d在一列而变化为其他两个变量的函数，一个和b，在其他两列中定义。我的目标是在d中识别块或异常值。这些异常值可能不是异常值，但对于我的情况，我想确定那些不符合可用线性拟合的数据云的数据。识别从1D异常值的块和2D数据在Python

问题：即使我以前从未做过聚类分析，名字听起来像是它可以实现我想要做的。在情况下，我选择了做聚类分析，我想这样做，针对两种情况如下：

与一个和d
与一个，b and d

我做了一些搜索并找到了＃1，使用KernelDensity模块会更合适，而对于＃2使用MeahShift模块在Python中都是不错的选择。

问题：我从来没有做过聚类分析之前，所以我不明白在他们给出的文档都KernelDensity和MeahShift的例子（here和here，分别）。是否有人可以解释如何使用KernelDensity和MeahShift来识别案例1和案例2中d中异常值的“块”？

来源

2015-07-09 Pupil

我觉得你首先需要一个强大的回归，因为您的数据已经被一些异常值已被污染。一旦稳健的回归拟合，那么在每个点计算的均方误差可以用作聚类中心的距离度量（回归线）。大MSE的观察可能是异常值。 –

sklearn中的强健回归参考链接。 http://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors –

@JanxunLi：我很抱歉，但我无法理解该参考文献中给出的示例。。你能举一个简单的例子吗？ – Pupil

首先，KernelDensity是用于非参数方法。由于您坚信关系是线性的（即参数化模型），因此KernelDensity不是此任务中最合适的选择。

下面是识别异常值的示例代码。

import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import RANSACRegressor 


# data: 1000 obs, 100 of them are outliers 
# ===================================================== 
np.random.seed(0) 
a = np.random.randn(1000) 
b = np.random.randn(1000) 
d = 2 * a - b + np.random.randn(1000) 
# the last 100 are outliers 
d[-100:] = d[-100:] + 10 * np.abs(np.random.randn(100)) 

fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a, d, c='g') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 
axes[1].scatter(b, d, c='g') 
axes[1].set_xlabel('b')

enter image description here

# processing 
# ===================================================== 
# robust regression 
robust_estimator = RANSACRegressor(random_state=0) 
robust_estimator.fit(np.vstack([a,b]).T, d) 
d_pred = robust_estimator.predict(np.vstack([a,b]).T) 

# calculate mse 
mse = (d - d_pred.ravel()) ** 2 

# get 50 largest mse, 50 is just an arbitrary choice and it doesn't assume that we already know there are 100 outliers 
index = argsort(mse) 
fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a[index[:-50]], d[index[:-50]], c='b', label='inliers') 
axes[0].scatter(a[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 
axes[0].legend(loc='best') 
axes[1].scatter(b[index[:-50]], d[index[:-50]], c='b', label='inliers') 
axes[1].scatter(b[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[1].legend(loc='best') 
axes[1].set_xlabel('b')

enter image description here

为您的样品数据

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import RANSACRegressor 

df = pd.read_excel('/home/Jian/Downloads/Data.xlsx').dropna() 

a = df.a.values.reshape(len(df), 1) 
d = df.d.values.reshape(len(df), 1) 

fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a, d, c='g') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 

robust_estimator = RANSACRegressor(random_state=0) 
robust_estimator.fit(a, d) 
d_pred = robust_estimator.predict(a) 

# calculate mse 
mse = (d - d_pred) ** 2 

index = np.argsort(mse.ravel()) 

axes[1].scatter(a[index[:-50]], d[index[:-50]], c='b', label='inliers', alpha=0.2) 
axes[1].scatter(a[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[1].set_xlabel('a') 
axes[1].legend(loc=2)

来源

2015-07-09 22:48:41

@Pupil我更新了代码。请看一看。 –

根据我在代码中可以理解的内容，你是-1）以你已经知道异常值的前提开始你的代码，2）对包括异常值在内的所有数据进行回归拟合。另外，你在稳健回归中使用的参数是什么？然而，我的目标是1）首先破译这些异常值块，2）仅对位于异常值下的数据云进行线性回归。 – Pupil

@Pupil不，我不假设任何关于异常值的知识。下半年的所有代码都不会假设它知道异常值是最后100个obs。上面的代码演示了如何去除异常值。如果您愿意，您可以使用剩余的内部人重新进行线性回归。 '.T'只是转置运算符，请确保每列都是一个特征。 –

识别从1D异常值的块和2D数据在Python

回答

相关问题