如何在Cray XE6计算节点上使用GNU并行（bash脚本）和aprun命令（Unix就像env）？

我想在mpi4py python脚本上运行16个实例：hello.py。我存储在这种s.txt 16个命令：如何在Cray XE6计算节点上使用GNU并行（bash脚本）和aprun命令（Unix就像env）？

python /lustre/4_mpi4py/hello.py > 01.out

我在克雷集群通过这样aprun命令提交此：

aprun -n 32 sh -c 'parallel -j 8 :::: s.txt'

我的目的是运行那些每蟒蛇工作8该脚本运行超过3小时，并且没有创建* .out文件。从PBS调度程序输出文件我得到这个：

Python version 2.7.3 loaded 
aprun: Apid 11432669: Caught signal Terminated, sending to application 
aprun: Apid 11432669: Caught signal Terminated, sending to application 
parallel: SIGTERM received. No new jobs will be started. 
parallel: SIGTERM received. No new jobs will be started. 
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now. 
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now. 
parallel: SIGTERM received. No new jobs will be started. 
parallel: SIGTERM received. No new jobs will be started. 
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now. 
parallel: Waiting for these 8 jobs to finish. Send SIGTERM again to stop now. 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 07.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 02.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 06.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 09.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 01.out 
parallel: SIGTERM received. No new jobs will be started. 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 10.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 04.out 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 08.out 
parallel: SIGTERM received. No new jobs will be started. 
parallel: python /lustre/beagle2/ams/testing/hpc_python_2015/examples/4_mpi4py/hello.py > 03.out

我在一个节点上运行它，它有32个核心。我想我使用GNU并行命令是错误的。有人可以帮助这个。

来源

2017-04-20 user2458189

你的克雷是什么？它是一种Linux，以及哪一种（包括版本）？ Doew你的脚本没有gnu'parallel'命令？你为什么要使用'parallel'命令（什么是任务; mpi通常在开始并行作业中是很好的）。 – osgx

这是一台超级计算机。目标是仅在1个节点上运行所有16个python脚本实例，但因为说节点有32GB，所以不能同时运行所有作业（16）（所以我只是在当时运行时说8），或者说你的应用程序没有线程化时。无论如何，我必须使用GNU并行。但我对这种语法很陌生，我认为我的错误在那里。 – user2458189

计算节点，我正在运行它有类似Unix的环境。这是Cray XE6。我的python脚本可以工作，我多次测试它。 – user2458189

正如https://portal.tacc.utexas.edu/documents/13601/1102030/4_mpi4py.pdf#page=8

from mpi4py import MPI 

comm = MPI . COMM_WORLD 

print " Hello ! I’m rank %02d from %02 d" % (comm .rank , comm . size) 

print " Hello ! I’m rank %02d from %02 d" % (comm . Get_rank() , 
comm . Get_size()) 

print " Hello ! I’m rank %02d from %02 d" % 
(MPI . COMM_WORLD . Get_rank() , MPI . COMM_WORLD . Get_size())

您4_mpi4py/hello.py上市方案是不典型的单过程（或单python脚本），但多进程MPI应用程序。

GNU parallel预计更简单的程序，不支持与MPI进程交互。

在您的群集中有许多节点，并且每个节点可能启动不同数量的MPI进程（每个节点有8个CPU的2个CPU考虑变体：每个8个OpenMP线程有2个MPI进程; 1个MPI进程的16个线程;没有线程的16个MPI进程）。并且为您的任务描述集群片段，在集群管理软件和脚本使用的python MPI包装器使用的MPI库之间存在一些接口。而管理是aprun（和qsub？）：

http://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/aprun-man-page/

https://www.nersc.gov/users/computational-systems/retired-systems/hopper/running-jobs/aprun/

必须使用aprun命令在料斗启动作业计算节点。用于串行，MPI，OpenMP，UPC和混合MPI/OpenMP或混合MPI/CAF作业。

https://wickie.hlrs.de/platforms/index.php/CRAY_XE6_Using_the_Batch_System

的XE6并行作业（包括MPI和OpenMP）的作业启动时aprun。 ...上面的aprun示例将使用参数“arg1”和“arg2”启动并行可执行文件“my_mpi_executable”。作业将使用64个MPI进程启动，每个分配的节点上放置32个进程（请记住，XE6系统中的节点由32个内核组成）。（qsub）之前，您需要由批处理系统分配节点。

有aprun和qsub和MPI之间的一些界面：在正常启动（aprun -n 32 python /lustre/4_mpi4py/hello.py）aprun刚开始你的MPI程序的多个（32）工艺，将接口中的每个进程的ID，并给他们的小组id（例如，带有像PMI_ID这样的环境变量;实际变量特定于启动器/ MPI库组合）。

GNU parallel与MPI程序没有任何接口，它对这样的变量一无所知。它只会启动比预期多8倍的进程。并且你不正确的命令中的所有32 * 8进程将具有相同的组ID;并且将有8个具有相同MPI进程ID的进程。他们会让你的MPI库变得不合时宜。

决不古老之前最MPI UNIX进程forkers像xargs或parallel或“并行度非常先进的bash脚本”混合MPI资源管理器/发射器。有MPI做平行的事情;并有MPI启动器/作业管理（aprun，mpirun，mpiexec）用于启动多个进程/分叉/ ssh到机器。

不要做aprun -n 32 sh -c 'parallel anything_with_MPI' - 这是不支持的组合。对于aprun只有可能的（允许的）参数是一些支持并行的程序，如OpenMP，MPI，MPI + OpenMP或非并行程序。（或启动一个并行程序的一个脚本）

如果你有几个独立的MPI任务开始，用几个参数来aprun：aprun -n 8 ./program_to_process_file1 : -n 8 ./program_to_process_file2 -n 8 ./program_to_process_file3 -n 8 ./program_to_process_file4

如果你有多个文件上工作，试图启动许多并行作业，不要使用单个qsub，而要使用几个并允许PBS（或使用哪个工作管理器）来管理您的工作。

如果文件数量非常大，请尽量不要在程序中使用MPI（永远不要链接MPI库/包含MPI头文件），并使用parallel或其他形式的古代并行处理，这是从aprun隐藏的。或者直接在您的代码中使用单个MPI程序和程序文件分发（MPI的主进程可以打开文件列表，然后在其他MPI进程之间分发文件 - 有或没有MPI/mpi4py的动态进程管理：http://pythonhosted.org/mpi4py/usrman/tutorial.html#dynamic-process-management）。

一些科学家试图MPI并行于其他序列结合：parallel ... aprun ...或parallel ... mpirun ...：

https://rcc.uchicago.edu/docs/tutorials/kicp-tutorials/running-jobs.html#gnu-parallel
http://www.hpc.lsu.edu/training/weekly-materials/2017-Spring/gnuparallel-Feb2017.pdf#page=41
且有平行的版本，为您的Cray：https://github.com/levinas/cray-parallel

来源

2017-04-20 04:53:12 osgx

非常感谢您的详细回答！你能告诉我，如果我有多个NON_MPI python命令里面的s.txt这将是适当的语法与aprun？ aprun -n 32 sh -c'parallel -j 8 :::: s.txt' – user2458189

我不能（从来没有用过aprun/cray）。但是使用非MPI python脚本（甚至不要导入mpi4py;并检查计算节点是否更新了脚本），“aprun”和“parallel”之间不会有明显的冲突。你有机会获得并尝试它，我认为它可能工作。 – osgx

但是，我不知道aprun是如何工作的;如果使用错误，这样的命令可能会多次启动“并行”（32？）。 – osgx

如何在Cray XE6计算节点上使用GNU并行（bash脚本）和aprun命令（Unix就像env）？

回答

相关问题