核密度估计julia

我正在尝试实现核密度估计。然而，我的代码并没有提供它应该的答案。它也写在茱莉亚，但代码应该是自我解释。核密度估计julia

这里是算法：

$\$ f(x) = \frac{1}{n*h} * \sum_{i = 1}^n K(\frac{x - X_i}{h}) \$$

其中

$\$ K(u) = 0.5*I(|u| <= 1)\$ with \$ u = \frac{x - X_i}{h}\$$

所以算法测试是否x和观察X_I由一些常数因数（binwidth）加权之间的距离少于一个。如果是这样，它将0.5 /（n * h）分配给该值，其中n =观测值的＃个。

这是我实现：

#Kernel density function. 
#Purpose: estimate the probability density function (pdf) 
#of given observations 
#@param data: observations for which the pdf should be estimated 
#@return: returns an array with the estimated densities 

function kernelDensity(data) 
| 
| #Uniform kernel function. 
| #@param x: Current x value 
| #@param X_i: x value of observation i 
| #@param width: binwidth 
| #@return: Returns 1 if the absolute distance from 
| #x(current) to x(observation) weighted by the binwidth 
| #is less then 1. Else it returns 0. 
| 
| function uniformKernel(x, observation, width) 
| | u = (x - observation)/width 
| | abs (u) <= 1 ? 1 : 0 
| end 
| 
| #number of observations in the data set 
| n = length(data) 
| 
| #binwidth (set arbitraily to 0.1 
| h = 0.1 
| 
| #vector that stored the pdf 
| res = zeros(Real, n) 
| 
| #counter variable for the loop 
| counter = 0 
| 
| #lower and upper limit of the x axis 
| start = floor(minimum(data)) 
| stop = ceil (maximum(data)) 
| 
| #main loop 
| #@linspace: divides the space from start to stop in n 
| #equally spaced intervalls 
| for x in linspace(start, stop, n) 
| | counter += 1 
| | for observation in data 
| | | 
| | | #count all observations for which the kernel 
| | | #returns 1 and mult by 0.5 because the 
| | | #kernel computed the absolute difference which can be 
| | | #either positive or negative 
| | | res[counter] += 0.5 * uniformKernel(x, observation, h) 
| | end 
| | #devide by n times h 
| | res[counter] /= n * h 
| end 
| #return results 
| res 
end 
#run function 
#@rand: generates 10 uniform random numbers between 0 and 1 
kernelDensity(rand(10))

，这被返回：

> 0.0 
> 1.5 
> 2.5 
> 1.0 
> 1.5 
> 1.0 
> 0.0 
> 0.5 
> 0.5 
> 0.0

其总和为：（累计发布包功能应该是1）8.5

所以有两个错误：

这些值未正确缩放。每个数字应该是其当前值的十分之一。事实上，如果观察由增加数量10^NN = 1，2，...则CDF也由10^N

例如增加：

> kernelDensity(rand(1000)) 
> 953.53

它们不总计为10（或者如果它不是缩放误差，则为1）。随着样本量的增加，误差会变得更加明显：未包括5％的观测值。

我相信我实现了公式1：1，因此我真的不明白错误在哪里。

来源

2015-09-04 Vincent

我不是KDEs的专家，所以采取这一切与一粒盐，而是一个非常类似（但要快得多！）执行你的代码是：

function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T) 
    res = similar(data) 
    lb = minimum(data); ub = maximum(data) 
    for (i,x) in enumerate(linspace(lb, ub, size(data,1))) 
    for obs in data 
     res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0. 
    end 
    res[i] /= (n*h) 
end 
sum(res) 
end

如果我我没有错，密度估计应该整合到1，也就是说我们预计kernelDensity(rand(100), 0.1)/100至少接近1.在上面的实现中，我到达那里，给出或拿5％，但是然后我们不再知道0.1是最佳带宽（使用h=0.135而不是我在那里达到0.1％），并且已知统一内核只有大约93％“有效率”。

在任何情况下，有一个很好的核密度封装朱莉娅可用here，所以你应该只是做Pkg.add("KernelDensity")而不是试图编写自己的Epanechnikov内核:)

来源

2015-09-04 11:04:11

谢谢你，代码和库。没有找到它。 – Vincent