2017-02-12 100 views
1

我正在运行一些TensorFlow代码,恢复并重新开始从检查点进行培训。每当我从CPU构建恢复它似乎工作得很好。但是,如果我尝试恢复时,我用gpu运行我的代码它似乎无法正常工作。尤其是我得到的错误:为什么TensorFlow恢复检查点内存不足,但原始脚本不会?

Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session. 
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615 

我看到它说我运行内存,但是当我增加内存说10GBs它并没有真正改变任何东西。这只会发生在我的GPU构建,因为CPU恢复完美。

无论如何,有什么想法或开始的想法可能会造成这种情况?

gpu的实质上是自动分配的,所以我不太清楚可能是什么原因造成的,或者甚至是调试的起始步骤。


完全错误:

E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615 
Traceback (most recent call last): 
    File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module> 
    large_main_hp.main_large_hp_ckpt(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt 
    run_hyperparam_search(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search 
    main_hp.main_hp(arg) 
    File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp 
    with tf.Session(graph=graph) as sess: 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__ 
    super(Session, self).__init__(target, graph, config=config) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__ 
    self._session = tf_session.TF_NewDeprecatedSession(opts, status) 
    File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.InternalError: Failed to create session. 

回答

0

Tensorflow来自物理和虚拟内存给你几乎无限的内存来操纵你的型号CPU的使用效益。调试的第一步是通过简单地删除一些权重/图层并在GPU上运行来构建较小的模型,以确保您没有编码错误。然后缓慢增加图层/权重,直到您再次耗尽内存。这将确认您在GPU上有内存问题。我建议最初在GPU上构建你的图形,就像你知道它在稍后训练时适合它一样。如果您需要大图,请考虑将图的部分分配给不同的GPU(如果有)。

+0

不知道这是否重要,但我有一个for循环,我建立不同的图形。所以我测试说3个模型,首先我训练第一个,然后是第二个,然后是最后一个。可能是错误的原因? –

+0

很有可能“默认情况下,TensorFlow映射几乎所有的GPU内存”,因此您需要确保您正确配置会话。 https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth –

+0

per_process_gpu_memory_fraction是你可能想要的。 config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config = config,...) –

相关问题