我想在mpi4py python脚本上运行16个实例:hello.py。我存储在这种s.txt 16个命令:如何在Cray XE6计算节点上使用GNU并行(bash脚本)和aprun命令(Unix就像env)?
python /lustre/4_mpi4py/hello.py > 01.out
我在克雷集群通过这样aprun命令提交此:
aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'
我的目的是运行那些每蟒蛇工作8该脚本运行超过3小时,并且没有创建* .out文件。从PBS调度程序输出文件我得到这个:
Python version 2.7.3 loaded
aprun: Apid 11432669: Caught signal Terminated, sending to application
aprun: Apid 11432669: Caught signal Terminated, sending to application
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: SIGTERM received. No new jobs will be started.
parallel: SIGTERM received. No new jobs will be started.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out
parallel: SIGTERM received. No new jobs will be started.
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out
我在一个节点上运行它,它有32个核心。 我想我使用GNU并行命令是错误的。有人可以帮助这个。
你的克雷是什么?它是一种Linux,以及哪一种(包括版本)? Doew你的脚本没有gnu'parallel'命令?你为什么要使用'parallel'命令(什么是任务; mpi通常在开始并行作业中是很好的)。 – osgx
这是一台超级计算机。目标是仅在1个节点上运行所有16个python脚本实例,但因为说节点有32GB,所以不能同时运行所有作业(16)(所以我只是在当时运行时说8),或者说你的应用程序没有线程化时。无论如何,我必须使用GNU并行。但我对这种语法很陌生,我认为我的错误在那里。 – user2458189
计算节点,我正在运行它有类似Unix的环境。这是Cray XE6。我的python脚本可以工作,我多次测试它。 – user2458189