执行此操作的一种方法是编写一个包装脚本,该脚本可以运行一系列任务,然后将其中的每个脚本生成为一个单独的脚本。
在您的片段,它看起来像你想运行每个计算节点的脚本的2个实例共获得8所以,在你的工作的脚本,你可以这样做:
for ((i=0; i<8; i+=2)); do
aprun -n 1 ./wrapper.sh $i 2 &
done
wait
然后在包装你可以这样做(其中附加$ J向你唯一索引):
end=$(($1 + $2))
for ((j=$1; j<$end; j+=1)); do
./examplebashscript.sh $j &
done
wait
您还可以设置以下环境变量,以获得不同的进程和线程的位置。你需要设置这些在你的shell(或作业脚本)运行 “aprun” 前:
export MPICH_CPUMASK_DISPLAY=1
export MPICH_RANK_REORDER_DISPLAY=1
例如,运行:
aprun -n 24 ./examplebashscript.sh
(的简写形式):
aprun -n 24 -N 24 -S 12 -d 1 ./examplebashscript.sh
将在STDERR上给出以下类型的输出(注意这是在XC30上,每个计算节点上有两个Intel Ivy Bridge 12-内核处理器,因此由于存在超线程,掩码显示每个节点上有48个内核):
[PE_0]: MPI rank order: Using default aprun rank ordering.
[PE_0]: rank 0 is on nid02749
[PE_0]: rank 1 is on nid02749
[PE_0]: rank 2 is on nid02749
[PE_0]: rank 3 is on nid02749
[PE_0]: rank 4 is on nid02749
[PE_0]: rank 5 is on nid02749
[PE_0]: rank 6 is on nid02749
[PE_0]: rank 7 is on nid02749
[PE_0]: rank 8 is on nid02749
[PE_0]: rank 9 is on nid02749
[PE_0]: rank 10 is on nid02749
[PE_0]: rank 11 is on nid02749
[PE_0]: rank 12 is on nid02749
[PE_0]: rank 13 is on nid02749
[PE_0]: rank 14 is on nid02749
[PE_0]: rank 15 is on nid02749
[PE_0]: rank 16 is on nid02749
[PE_0]: rank 17 is on nid02749
[PE_0]: rank 18 is on nid02749
[PE_0]: rank 19 is on nid02749
[PE_0]: rank 20 is on nid02749
[PE_0]: rank 21 is on nid02749
[PE_0]: rank 22 is on nid02749
[PE_0]: rank 23 is on nid02749
[PE_23]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000100000000000000000000000
[PE_22]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000010000000000000000000000
[PE_21]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000001000000000000000000000
[PE_0]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000001
[PE_20]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000100000000000000000000
[PE_9]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000001000000000
[PE_11]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000100000000000
[PE_10]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000010000000000
[PE_8]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000100000000
[PE_1]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000000100
[PE_18]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000001000000000000000000
[PE_7]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000010000000
[PE_15]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000001000000000000000
[PE_3]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000001000
[PE_6]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000001000000
[PE_16]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000010000000000000000
[PE_14]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000100000000000000
[PE_13]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000010000000000000
[PE_12]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000001000000000000
[PE_4]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000010000
[PE_5]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000000000000000100000
[PE_17]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000000100000000000000000
[PE_19]: cpumask set to 1 cpu on nid02749, cpumask = 000000000000000000000000000010000000000000000000
您可能可以通过某种方式捕捉到这一点。
我一点都不熟悉'aprun',你是对的,从看它,文件是不是非常好。但是我会尝试的一件事就是将环境变量使用'env'转储到某个文件中,并查看是否通过环境变量传递了这些信息。你可以使用像'env> $(hostname) - $$。env'这样的东西写出一个基于正在运行的进程的主机名和PID命名的文件,希望每次调用都可以得到不同的结果。 – 2015-03-13 18:48:04
我刚刚尝试过,不幸的是我没有看到任何接近我需要的东西。有一些SLURM变量(如SLURM_NNODES,SLURM_JOBID),它们在所有作业中都是相同的。因此,我需要有人对如何为阿伦运行独特的工作提供一些启示。 – user4668442 2015-03-13 19:14:19