2016-08-02 38 views
5

我在使用slurm(http://slurm.schedmd.com/)工作负载管理器时遇到此错误。当我运行一些tensorflow python脚本时,有时会导致错误(附加)。它似乎无法找到安装的cuda库,但我正在运行不需要GPU的脚本。因此,我觉得为什么cuda会成为一个问题,这让我很困惑。如果我不需要它,为什么cuda安装是一个问题?为什么在slurm中的作业是TensorFlow脚本时无限期冻结?

我从SLURM-JOB_ID文件得到的唯一有用信息是以下几点:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
""" 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine. 

我一直以为tensorflow将不需要GPU。所以我假设最后一个错误说没有GPU不会导致错误(纠正我,如果我错了)。

我不明白为什么我需要CUDA库。我试图用GPU运行我的作业,如果我的作业是CPU作业,为什么我需要cuda库?


我试图登录到节点直接和启动tensorflow,但我没有明显的错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 

虽然我预计错误:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally 
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcudnn.so. LD_LIBRARY_PATH: /cm/shared/openmind/cuda/7.5/lib64:/cm/shared/openmind/cuda/7.5/lib 
I tensorflow/stream_executor/cuda/cuda_dnn.cc:2092] Unable to load cuDNN DSO 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally 
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: node047 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:347] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.63 Sat Nov 7 21:25:42 PST 2015 
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
""" 
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine. 

我也在张量流库中做了官方的git问题:

https://github.com/tensorflow/tensorflow/issues/3632

+1

回答“为什么会这样?”:来自slurm环境内的张量流不能找到libcuda.so:'libcuda报告的版本是:未找到:找不到libcuda.so' –

+0

@RobertCrovella因此错误不是由于'libcuda报告的版本是:未找到:无法找到libcuda.so'我一直认为,如果它找不到GPU,它就不会使用它,这没关系。 –

+0

做了一个官方的git问题,看看有人可以帮我解决这个问题:https://github.com/tensorflow/tensorflow/issues/3632 –

回答

1

在通过批处理作业提交slurm时,张量运行存在一些错误。

目前我通过在slurm上运行srun来绕过它。

它也出现在您的案例中,您安装了tensorflow的GPU版本,并在没有GPU的机器上运行它。这是你的情况造成的另一个错误。

+0

你是什么意思,你正在运行srun?你介意澄清这一点吗?不幸的是,我需要一次运行大约30个脚本,这是不行的。我想我和GPU一起卡住了(它确实卡住了,但更少)。 –

+0

“srun - 空bash”会给你一个互动会话。当他们修复它或我发现它背后的原因时,我会发布它,但我所知道的是,提交sbatch作业存在一个错误。 – Steven

+0

因此,当你运行srun和bash时,你会运行这些工作,并且所有内容都按预期运行? (就像更新一样,它有时也会卡在GPU上) –

0

我一直有一个类似的问题,并且我把它写到了将模型写入光泽文件系统时的保存程序挂起。尽管如此,仍然在等待一个真正的解