2017-09-01 2430 views
0

我正在Bluehive中运行代码。代码有一些参数N.如果N很小,那么代码运行得很好。但是,对于稍微大的N(例如N = 10)的码被运行数个小时,并在结束时我收到以下错误消息:slurmstepd:错误:在某个点超出步骤内存限制

slurmstepd: error: Exceeded step memory limit at some point. 

其中我提交批处理文件有以下代码:

#!/bin/bash 
#SBATCH -o log.%a.txt -t 3-01:01:00 
#SBATCH --mem-per-cpu=1gb 
#SBATCH -c 4 
#SBATCH --gres=gpu:1 
#SBATCH -J Ankani 
#SBATCH -a 1-2 

python run.py $SLURM_ARRAY_TASK_ID 

我为代码分配了足够的内存。但仍然得到错误

"slurmstepd: error: Exceeded step memory limit at some point." 

有人可以帮忙吗?

回答

0

但是,我会注意到,此错误消息中“步骤内存限制”所描述的内存限制不一定与您的进程的RSS有关。此限制被提供并通过该cgroup插件执行,而存储器的cgroup

track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

这里是the source of this text