用python删除曲线下方的数据点

我需要比较一些理论数据与python中的实际数据。理论数据来自解析方程。为了提高对比度，我想删除远离理论曲线的数据点。我的意思是，我想删除图中红色虚线下方的点（用matplotlib制作）。 Data points and theoretical curves 用python删除曲线下方的数据点

理论曲线和数据点都是不同长度的数组。

我可以尝试以粗眼的方式取出点，例如：第一上点可以使用检测：

data2[(data2.redshift<0.4)&data2.dmodulus>1] 
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])

但我想用一个不太粗眼的方式。

所以，任何人都可以帮助我找到一个简单的方法来消除有问题的点？

谢谢！

来源

2011-10-31 Illa Rivero Losada

+18

纯粹从科学的角度来看，我不会删除这些观点，除非有一个极端有效的理由，认为他们错了。你有足够的数据，离群点不会对拟合产生任何影响，所以删除它们只会使图形看起来很漂亮，而不会达到任何科学目的。 – NickLH

你说得对，但我被告知。 –

只要看看红色曲线和点之间的差异，如果它大于红色曲线和红色曲线之间的差异，就将其删除。

diff=np.abs(points-red_curve) 
index= (diff>(dashed_curve-redcurve)) 
filtered=points[index]

但请从NickLH严肃的评论。你的数据看起来不错，没有任何过滤，你的“outlieres”都有一个非常大的错误，不会影响合适。

来源

2011-10-31 21:15:24 tillsten

这是一个很好的IDE，但是我发现很难计算红色曲线和点之间的差异，因为两个数组的长度都不相同。可以使用插值来创建点数组长度的红色曲线数组。 –

red_curves可能是用一个函数做出来的，只是它的相关x值。 – tillsten

这可能是矫枉过正，并基于您的评论

无论是理论曲线和数据点不同长度的阵列。

我将执行以下操作：

截断数据设定为使得其x值处于最大和理论的一组的分钟值范围内。
使用scipy.interpolate.interp1d和上述截断的数据x值插值理论曲线。步骤（1）的理由是为了满足interp1d的限制。
使用numpy.where来查找超出可接受理论值范围的数据y值。
如评论和其他答案中所建议的，不要放弃这些值。如果你想要清晰起见，通过绘制'内联'一种颜色和'异常值'一种颜色来指出它们。

这是一个脚本，接近你在找什么，我想。它希望能帮助你实现你想要什么：

import numpy as np 
import scipy.interpolate as interpolate 
import matplotlib.pyplot as plt 

# make up data 
def makeUpData(): 
    '''Make many more data points (x,y,yerr) than theory (x,y), 
    with theory yerr corresponding to a constant "sigma" in y, 
    about x,y value''' 
    NX= 150 
    dataX = (np.random.rand(NX)*1.1)**2 
    dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX 
    dataErr = np.random.rand(NX)*dataX*1.3 
    theoryX = np.arange(0,1,0.1) 
    theoryY = theoryX*theoryX*1.5 
    theoryErr = 0.5 
    return dataX,dataY,dataErr,theoryX,theoryY,theoryErr 

def makeSameXrange(theoryX,dataX,dataY): 
    ''' 
    Truncate the dataX and dataY ranges so that dataX min and max are with in 
    the max and min of theoryX. 
    ''' 
    minT,maxT = theoryX.min(),theoryX.max() 
    goodIdxMax = np.where(dataX<maxT) 
    goodIdxMin = np.where(dataX[goodIdxMax]>minT) 
    return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin] 

# take 'theory' and get values at every 'data' x point 
def theoryYatDataX(theoryX,theoryY,dataX): 
    '''For every dataX point, find interpolated thoeryY value. theoryx needed 
    for interpolation.''' 
    f = interpolate.interp1d(theoryX,theoryY) 
    return f(dataX[np.where(dataX<np.max(theoryX))]) 

# collect valid points 
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr): 
    '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return 
    valid indicies.''' 
    withinUpper = np.where(dataY<(interpTheoryY+theoryErr)) 
    withinLower = np.where(dataY[withinUpper] 
        >(interpTheoryY[withinUpper]-theoryErr)) 
    return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower] 

def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr): 
    '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return 
    valid indicies.''' 
    withinUpper = np.where(dataY>(interpTheoryY+theoryErr)) 
    withinLower = np.where(dataY<(interpTheoryY-theoryErr)) 
    return (dataX[withinUpper],dataY[withinUpper], 
      dataX[withinLower],dataY[withinLower]) 
if __name__ == "__main__": 

    dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData() 

    TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY) 

    interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX) 

    inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY, 
            theoryErr) 

    outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX, 
                TruncDataY, 
                interpTheoryY, 
                theoryErr) 
    #print inlierIndex 
    fig = plt.figure() 
    ax = fig.add_subplot(211) 

    ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k') 
    ax.plot(theoryX,theoryY,'r-') 
    ax.plot(theoryX,theoryY+theoryErr,'r--') 
    ax.plot(theoryX,theoryY-theoryErr,'r--') 
    ax.set_xlim(0,1.4) 
    ax.set_ylim(-.5,3) 
    ax = fig.add_subplot(212) 

    ax.plot(inDataX,inDataY,'ko') 
    ax.plot(outUpX,outUpY,'bo') 
    ax.plot(outDownX,outDownY,'ro') 
    ax.plot(theoryX,theoryY,'r-') 
    ax.plot(theoryX,theoryY+theoryErr,'r--') 
    ax.plot(theoryX,theoryY-theoryErr,'r--') 
    ax.set_xlim(0,1.4) 
    ax.set_ylim(-.5,3) 
    fig.savefig('findInliers.png')

这个数字是结果： enter image description here

来源

2011-11-01 13:03:06 Yann

要么你可以使用numpy.where（），以确定其对XY满足您的绘图标准，或也许列举几乎相同的事情。例如：

x_list = [ 1, 2, 3, 4, 5, 6 ] 
y_list = ['f','o','o','b','a','r'] 

result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5] 

print result

我相信你可以改变条件，使“2”和“5”在上面的例子中是你的曲线的功能

来源

2011-11-02 15:08:23 Dan

最后我用一些晏的代码：

def theoryYatDataX(theoryX,theoryY,dataX): 
'''For every dataX point, find interpolated theoryY value. theoryx needed 
for interpolation.''' 
f = interpolate.interp1d(theoryX,theoryY) 
return f(dataX[np.where(dataX<np.max(theoryX))]) 

def findOutlierSet(data,interpTheoryY,theoryErr): 
    '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return 
    valid indicies.''' 

    up = np.where(data.dmodulus > (interpTheoryY+theoryErr)) 
    low = np.where(data.dmodulus < (interpTheoryY-theoryErr)) 
    # join all the index together in a flat array 
    out = np.hstack([up,low]).ravel() 

    index = np.array(np.ones(len(data),dtype=bool)) 
    index[out]=False 

    datain = data[index] 
    dataout = data[out] 

    return datain, dataout 

def selectdata(data,theoryX,theoryY): 
    """ 
    Data selection: z<1 and +-0.5 LFLRW separation 
    """ 
    # Select data with redshift z<1 
    data1 = data[data.redshift < 1] 

    # From modulus to light distance: 
    data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error) 

    # redshift data order 
    data1.sort(order='redshift') 

    # Outliers: distance to LFLRW curve bigger than +-0.5 
    theoryErr = 0.5 
    # Theory curve Interpolation to get the same points as data 
    interpy = theoryYatDataX(theoryX,theoryY,data1.redshift) 

    datain, dataout = findOutlierSet(data1,interpy,theoryErr) 
    return datain, dataout

使用这些功能，我终于可以得到：

Data selection

谢谢大家的帮助。

来源

2011-11-08 14:52:58

+1为了向我们展示您的解决方案，并保持图表上的外围点。 – NickLH

用python删除曲线下方的数据点

回答

相关问题