2017-08-11 138 views
0

我尝试在P100节点上安装和使用Theano与Cuda-9.0。安装本身流畅,但我得到分段错误(见下文)。使用cuda-9.0的Theano段错误

我尝试使用Theano-0.9.0和Theano-0.10.0beta1结合使用libgpuarray/pygpu - 0.6.8和0.6.9。所有的情况都会导致段错误。

这里是我的设置: * RHEL 7 * GCC:4.8.5 * CUDA 9.0 * cuDNN:5.1.5 *的Python:2.7.13 * cmake的:3.7.2

[[email protected] ~]$ python 
Python 2.7.13 (default, Aug 10 2017, 07:33:11) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2 
Type "help", "copyright", "credits" or "license" for more information. 
>>> import theano 
-------------------------------------------------------------------------- 
A process has executed an operation involving a call to the 
"fork()" system call to create a child process. Open MPI is currently 
operating in a condition that could result in memory corruption or 
other system errors; your job may hang, crash, or produce silent 
data corruption. The use of fork() (or system() or other calls that 
create child processes) is strongly discouraged. 

The process that invoked fork was: 

    Local host:   [[52508,1],0] (PID 3946) 

If you are *absolutely sure* that your application will successfully 
and correctly survive a call to fork(), you may disable this warning 
by setting the mpi_warn_on_fork MCA parameter to 0. 
-------------------------------------------------------------------------- 
[c460:03946] *** Process received signal *** 
[c460:03946] Signal: Segmentation fault (11) 
[c460:03946] Signal code: Invalid permissions (2) 
[c460:03946] Failing at address: 0x3fff8d48f5b0 
[c460:03946] [ 0] [0x3fff9cdf0478] 
[c460:03946] [ 1] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(load_libcuda+0x60)[0x3fff8631b5e0] 
[c460:03946] [ 2] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x3f384)[0x3fff862df384] 
[c460:03946] [ 3] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(+0x41118)[0x3fff862e1118] 
[c460:03946] [ 4] /home/bsankara/software/ppc64le-08102017/lib/libgpuarray.so.2(gpucontext_init+0x90)[0x3fff862c7930] 
[c460:03946] [ 5] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x2c974)[0x3fff8638c974] 
[c460:03946] [ 6] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x101050)[0x3fff9cc61050] 
[c460:03946] [ 7] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x54318)[0x3fff863b4318] 
[c460:03946] [ 8] /home/bsankara/software/ppc64le-08102017/lib/python2.7/site-packages/pygpu-0.6.8-py2.7-linux-ppc64le.egg/pygpu/gpuarray.so(+0x56530)[0x3fff863b6530] 
[c460:03946] [ 9] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554] 
[c460:03946] [10] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8e64)[0x3fff9ccc9484] 
[c460:03946] [11] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [12] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524] 
[c460:03946] [13] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [14] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x8f04)[0x3fff9ccc9524] 
[c460:03946] [15] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [16] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484] 
[c460:03946] [17] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960] 
[c460:03946] [18] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x188e50)[0x3fff9cce8e50] 
[c460:03946] [19] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18ad54)[0x3fff9ccead54] 
[c460:03946] [20] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x18a540)[0x3fff9ccea540] 
[c460:03946] [21] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2f4)[0x3fff9cceb7b4] 
[c460:03946] [22] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(+0x15d038)[0x3fff9ccbd038] 
[c460:03946] [23] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyCFunction_Call+0x164)[0x3fff9cc31554] 
[c460:03946] [24] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyObject_Call+0x74)[0x3fff9cbc1ab4] 
[c460:03946] [25] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x68)[0x3fff9ccbfc68] 
[c460:03946] [26] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3214)[0x3fff9ccc3834] 
[c460:03946] [27] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xb40)[0x3fff9cccb360] 
[c460:03946] [28] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x34)[0x3fff9cccb484] 
[c460:03946] [29] /home/bsankara/software/ppc64le-08102017/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xe0)[0x3fff9cce8960] 
[c460:03946] *** End of error message *** 
Segmentation fault 

任何帮助,将不胜感激。谢谢。

回答

0

从网上抓取演示mpi C++或c代码,并用mpicc/mpiC++进行编译。检查编译器是否工作并且您制作的可执行文件可以运行,并且可以管理群集中不同节点之间的点对点通信。

您可能使用了错误的mpicc来编译theano,并且该编译器与inifiniband(或连接集群中的计算机的任何硬件)库没有二进制兼容性。

例如,如果InfiniBand库由gcc编译,并且theano由基于intel编译器的mpicc编译,那么它将不起作用。

您可以设置一个环境变量来请求openmpi的mpicc使用另一个编译器。

如果您在该计算机上有不同编译器编译的多个mpi实现...尝试使用ldd来找出哪个共享库对象(那些.so文件)取决于哪一个。

最好的情况当然是使用相同的编译器和相同的mpi包装来编译所有的东西,并将这些文件包装成几个modules

0

答案变成了gcc版本和libgpuarray。出于某种原因,gcc-4.8.5与libgpuarray存在问题,这就是导致分段错误的原因。

我在我的用户空间中安装了gcc-5.4.0,并重新编译了cmake和libgpuarray以及其他的包括theano和numpy(只是可以肯定),然后它不再有Segmentation错误。

另一个变化是集群管理员使用新的驱动程序将CUDA更新到9.0.151 384.66