2017-09-01 2430 views

我正在Bluehive中运行代码。代码有一些参数N.如果N很小,那么代码运行得很好。但是,对于稍微大的N(例如N = 10)的码被运行数个小时,并在结束时我收到以下错误消息:slurmstepd:错误:在某个点超出步骤内存限制

slurmstepd: error: Exceeded step memory limit at some point. 


#SBATCH -o log.%a.txt -t 3-01:01:00 
#SBATCH --mem-per-cpu=1gb 
#SBATCH -c 4 
#SBATCH --gres=gpu:1 
#SBATCH -J Ankani 
#SBATCH -a 1-2 

python run.py $SLURM_ARRAY_TASK_ID 


"slurmstepd: error: Exceeded step memory limit at some point." 





track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

这里是the source of this text