我有一个大名单(〜30GB)和功能如下:parLapply从内部功能将数据复制到节点意外
cl <- makeCluster(24, outfile = "")
Foo1 <- function(cl, largeList) {
return(parLapply(cl, largeList, Bar))
}
Bar1 <- function(listElement) {
return(nrow(listElement))
}
Foo2 <- function(cl, largeList, arg) {
clusterExport(cl, list("arg"), envir = environment())
return(parLapply(cl, largeList, function(x) Bar(x, arg)))
}
Bar2 <- function(listElement, arg) {
return(nrow(listElement))
}
有没有问题:
Foo1(cl, largeList)
看内存使用情况对于每个进程,我可以看到只有一个列表元素被复制到每个节点。
但是,调用时:
Foo2(cl, largeList, 0)
largeList的副本被复制到每个节点。通过Foo2,largeList复制不会在clusterExport发生,而是在parLapply上发生。另外,当我从全局环境(而不是函数内)执行Foo2的主体时,没有问题。这是什么造成的?
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] xts_0.9-7 zoo_1.7-12 snow_0.3-13
[4] Rcpp_0.12.2 randomForest_4.6-12 gbm_2.1.1
[7] lattice_0.20-33 survival_2.38-3 e1071_1.6-7
loaded via a namespace (and not attached):
[1] class_7.3-13 tools_3.2.2 grid_3.2.2
什么操作系统和什么是您的makeCluster调用? –
操作系统是Fedora 21.我编辑的问题包括makeCluster调用和sessionInfo – tmakino
我相信,不管操作系统,默认群集类型是PSOCK vs FORK。这是我在包中使用的簇:'if(grepl(“Windows”,sessionInfo()$ running)){cl < - makeCluster(nnodes,type =“PSOCK”)} else {cl < - makeCluster(nnodes ,type =“FORK”)}'...你能确认你的集群类型是使用分叉吗? –