多边并行嵌套在循环中的循环工作，但逻辑上没有意义？

我有一个很大的语料库，我正在与tm::tm_map()进行转换。由于我使用托管的R Studio，因此我有15个内核，并希望利用并行处理来加快速度。多边并行嵌套在循环中的循环工作，但逻辑上没有意义？

没有共享一个非常大的语料库，我简直无法用虚拟数据重现。

我的代码如下。对问题的简短描述是在控制台中手动循环切片，但在我的函数内部不这样做。

函数“clean_corpus”将语料库作为输入，将其分解成片段并保存到临时文件以帮助解决内存问题。然后该函数使用%dopar％块对每个片段进行迭代。该功能在对语料库的一小部分进行测试时起作用，例如10K文件。但是在较大的语料库上，函数返回NULL。为了调试，我设置了函数来返回已经循环的单个片段，而不是整个重建的语料库。我发现在较小的语料库样本中，代码会按预期返回所有迷你语料库的列表，但是当我在语料库的较大样本上进行测试时，该函数将返回一些空值。

这里的原因，这是莫名其妙对我说：

cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works 
cleaned.corpus <- clean_corpus(corpus.regular[10001:20000], n = 1000) # also works 
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # NULL

如果我这样做是10K块高达例如50k通过5次迭代一切正常。如果我在例如完整的50k文件它返回NULL。

所以，也许我只是需要循环更多的小碎片通过打破我的语料库。我试过这个。在下面的clean_corpus函数中，参数n是每件的长度。该函数仍然返回NULL。

所以，如果我重复这样的：

# iterate over 10k docs in 10 chunks of one thousand at a time 
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000)

如果我这样做，手动5次高达50K一切正常。通过我的函数在一次调用中这样做的等价物是：

# iterate over 50K docs in 50 chunks of one thousand at a time 
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000)

返回NULL。

This SO帖子和唯一答案中的链接提示可能与我在linux上的RStudio托管实例有关，因为linux“内存不足的凶手”可能会阻止工作人员。这就是为什么我试图将我的语料库分解成碎片，以解决内存问题。

任何关于为什么在10k大小的1k中迭代10k文档的任何理论或建议都适用，而50k大小的1k不适用？

这里的clean_corpus功能：从上面再次

clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing 

    # split the corpus into pieces for looping to get around memory issues with transformation 
    nr <- length(corpus) 
    pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr)) 
    lenp <- length(pieces) 

    rm(corpus) # save memory 

    # save pieces to rds files since not enough RAM 
    tmpfile <- tempfile() 
    for (i in seq_len(lenp)) { 
    saveRDS(pieces[[i]], 
      paste0(tmpfile, i, ".rds")) 
    } 

    rm(pieces) # save memory 

    # doparallel 
    registerDoParallel(cores = 14) # I've experimented with 2:14 cores 
    pieces <- foreach(i = seq_len(lenp)) %dopar% { 
    piece <- readRDS(paste0(tmpfile, i, ".rds")) 
    # transformations 
    piece <- tm_map(piece, content_transformer(replace_abbreviation)) 
    piece <- tm_map(piece, content_transformer(removeNumbers)) 
    piece <- tm_map(piece, content_transformer(function(x, ...) 
     qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/")))) 
    } 

    # combine the pieces back into one corpus 
    corpus <- do.call(function(...) c(..., recursive = TRUE), pieces) 
    return(corpus) 

} # end clean_corpus function

代码块只是打字功能之后可读性的流程：

# iterate over 10k docs in 10 chunks of one thousand at a time 
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works 

# iterate over 50K docs in 50 chunks of one thousand at a time 
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # does not work

但在控制台迭代通过调用每个

功能

corpus.regular[1:10000], corpus.regular[10001:20000], corpus.regular[20001:30000], corpus.regular[30001:40000], corpus.regular[40001:50000] # does work on each run

注意我尝试使用库tm功能进行并行处理（请参见here），但我一直在打“无法分配内存”错误，这就是为什么我试图使用doparallel %dopar%“自己做”的原因。

来源

2017-08-23 Doug Fir

嗨，感谢您的评论。我知道这是一个记忆问题..但这正是我去循环路线的原因。循环是否有助于通过大块计算而不是整体计算来缓解这种情况？ –

此外，我确实看着他的脚本运行1 +核心通过壳>顶部> 1.在每种情况下似乎都失去了免费的内存。 –

啊，我从来没有考虑过这个。事情是我能够将整个结构加载到R.50k样本对于整个10M文档语料库是很小的，所以即使是块也不应该导致内存问题。我想知道我是否应该尝试将所有片断保存到临时文件中，就像我在功能的顶部附近做的那样 –

从评论的解决方案概要

你的内存问题可能与corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)，因为这仍然保存你所有的（输出）的数据在内存中

我建议从每个工人到出口的输出文件，例如RDS或csv文件，而不是将其收集到最后的单个数据结构中

另一个问题（如您所指出的）是：foreach会将隐含的return语句（{}中的代码块作为函数对待dopar之后）的每个工人的输出保存。我建议在关闭}之前添加一个明确的return(1)，以便不将预期的输出保存到内存中（您已将其显式保存为文件）。

来源

2017-08-23 11:26:22 CPak

多边并行嵌套在循环中的循环工作，但逻辑上没有意义？

回答

相关问题