空间分布/模拟/密度函数

想象一下，你有相当大的数据集2.000.000点随机发布在一些多边形空间。密度函数必须从随机选择的4.000点样本中随时测量。这个过程必须重复200次。我的代码并没有很好地解决这个问题。任何建议如何改进代码。空间分布/模拟/密度函数

# coord is SpatialPoints Object 
library(sp) 
library(maptools) 
library(map)

你可以从下面的链接的对象polygonial：使用酷睿i3，2.27Ghz，4个处理器https://www.dropbox.com/sh/65c3rke0gi4d8pb/LAKJWhwm-l

germG <- readShapePoly("vg250_gem.shp") 
coord <- spsample(germG, 2e06, "random") # this command needs some minutes to be done. 

# R is the number of simulations 
R <- 200 
M <- matrix(NA,R, 256) 
ptm=proc.time() 
for(r in 1:R) { 
    ix <- sample(1:2e06,size=4000) 
    Dg <- spDists(coord[ix]) 
    Dg <- as.vector(Dg[Dg!=0]) 
    kg <- density(Dg,bw="nrd0",n=256) 
    M[r,] <- kg$y 
} 
runningtime = proc.time()-ptm 
cat("total run time (sec) =",round(runningtime[3],1),"\n")

上部代码提供了一个总的运行时间（秒）= 964.8和4 Gb RAM。如何加快这个for-loop仿真的性能？我会非常感谢你的评论，评论和建议。

来源

2014-01-17 And_R

为您的代码添加更多评论总能帮助其他人更好地遵循它。另外，最特别的是什么？你的目标是什么（例如特定的时间）？ – camdixon

@camdixon我已经通过编辑初始文本添加了更多信息。谢谢你的建议。 –

因此，坐标是200万点，但是只有在创建了R（200）个采样点并创建了M矩阵后才开始测量时间。我会在开始时测量时间，看看创建坐标需要多长时间，然后再次测量以查看采样需要多长时间。此外，基于你的代码，我不知道你在哪里设置$ y，你能否给我澄清这个评论？ – camdixon

不是一个真正的答案，只是一些意见：

如果R =＃迭代，并且每次迭代S =样本大小（例如，R = 200和S = 4000），那么你的运行时间将是〜O（R × S ）。因此，加倍运行和将样本大小减半会使运行时间减少2倍。
spDists(...)中的默认距离度量标准是欧几里得。如果这是您想要的，那么使用dist(..)函数会更好 - 它效率更高（请参阅下面的代码）。如果您需要地理距离（例如Great Circle），则必须使用spDists(..., longlat=T)。
spDists(...)计算全程矩阵，而不仅仅是斜下方。这意味着所有非零距离显示两次，这对密度有影响。这就是为什么M1和M2在下面的代码中是而不是等于。
对于大S，分析您的代码（使用Rprof）显示大约46％的时间花费在density(...)之间，另有38％花费在spDists(...)中。所以这是一个用调用apply，lapply等代替for循环的例子，对于这些并不会有多大帮助。
还有其他函数用于计算地理距离矩阵 - 假设这是您想要的，但没有一个比您已经使用的更快。我在fossil包中尝试了earth.dist(...)包，geosphere包中包含distm(...)包，而在fields包中尝试了rdist.earth(...)，但这些操作都没有改进。

代码：

library(sp) 
library(maptools) 
germG <- readShapePoly("vg250_gem.shp") 
coord <- spsample(germG, 1e4, "random") # Just 10,000 points... 
R <- 200 

# dist(...) and spDists(..., longlat=F) give same result 
zz <- coord[sample(1e4,size=200)] 
d1 <- spDists(zz) 
d2 <- dist([email protected]) 
max(abs(as.matrix(d1)-as.matrix(d2))) 
# [1] 0 
# but dist(...) is much faster 
M1 <- matrix(NA,R, 256) 
set.seed(1) 
system.time({ 
    for(r in 1:R) { 
    ix <- sample(1e4,size=200) # S = 200; test case 
    Dg <- spDists(coord[ix])  # using spDists(...) 
    Dg <- as.vector(Dg[Dg!=0]) 
    kg <- density(Dg,bw="nrd0",n=256) 
    M1[r,] <- kg$y 
    } 
}) 
# user system elapsed 
# 11.08 0.17 11.28 

M2 <- matrix(NA,R, 256) 
set.seed(1) 
system.time({ 
    for(r in 1:R) { 
    ix <- sample(1e4,size=200) # S = 200; test case 
    Dg <- dist(coord[ix]@coords) # using dist(...) 
    Dg <- as.vector(Dg[Dg!=0]) 
    kg <- density(Dg,bw="nrd0",n=256) 
    M2[r,] <- kg$y 
    } 
}) 
# user system elapsed 
# 2.67 0.03 2.73

编辑针对OP的请求。以下是大小= 200的分析代码。

R=200 
M <- matrix(NA,R, 256) 
Rprof("profile") 
set.seed(1) 
system.time({ 
    for(r in 1:R) { 
    ix <- sample(1e4,size=200) # S = 200; test case 
    Dg <- spDists(coord[ix])  # using spDists(...) 
    Dg <- as.vector(Dg[Dg!=0]) 
    kg <- density(Dg,bw="nrd0",n=256) 
    M[r,] <- kg$y 
    } 
}) 
Rprof(NULL) 
head(summaryRprof("profile")$by.total,10) 
#     total.time total.pct self.time self.pct 
# "system.time"   11.52 100.00  0.02  0.17 
# "spDists"    7.08  61.46  0.02  0.17 
# "matrix"    6.76  58.68  0.24  2.08 
# "apply"     6.58  57.12  0.26  2.26 
# "FUN"     5.88  51.04  0.22  1.91 
# "spDistsN1"    5.66  49.13  3.36 29.17 
# "density"    3.18  27.60  0.02  0.17 
# "density.default"  3.16  27.43  0.06  0.52 
# "bw.nrd0"    1.98  17.19  0.00  0.00 
# "quantile"    1.76  15.28  0.02  0.17

随着S变大，计算密度开始占主导地位，因为结果必须进行排序。您可以使用ix <- sample(1e4,size=4000)运行此代码以查看它。

来源

2014-01-17 20:51:41 jlhoward

M1和M2的差异是由bw =“nrd0”（Silverman选择带宽的方法）引起的，如果你修正h < - 10000，并且在bw = h之后，M1和M2将是相同的。我认为这是很好的帮助。谢谢。 –

您能否将我的代码（Rprof）分析发送给我的电子邮件（和.stepanyan（at）gmail.com）。我尝试过，但不幸的是它不能正常工作。另一种提高计算能力的方法是@camdixon试图分割处理器内核的想法。我从来没有做过这样的事情，你有没有经历过这样的事情？现在我正在研究这个，如果我能找到值得的东西，我也会和你分享。 –

请参阅我的编辑分析代码。我已经删除了个人资料 - 抱歉。 – jlhoward

您可能会发现提前定义空白矩阵DG的速度稍微快一些。

除此之外，您可能需要考虑多核应用程序，并授予足够的RAM空间。

来源

2014-01-17 06:15:08 RegressForward

你可能知道任何可以分裂内核来加速代码性能的方法，就像我的一样。 –

空间分布/模拟/密度函数

回答

相关问题