1
我正在运行一些TensorFlow代码,恢复并重新开始从检查点进行培训。每当我从CPU构建恢复它似乎工作得很好。但是,如果我尝试恢复时,我用gpu运行我的代码它似乎无法正常工作。尤其是我得到的错误:为什么TensorFlow恢复检查点内存不足,但原始脚本不会?
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
我看到它说我运行内存,但是当我增加内存说10GBs它并没有真正改变任何东西。这只会发生在我的GPU构建,因为CPU恢复完美。
无论如何,有什么想法或开始的想法可能会造成这种情况?
gpu的实质上是自动分配的,所以我不太清楚可能是什么原因造成的,或者甚至是调试的起始步骤。
完全错误:
E tensorflow/core/common_runtime/direct_session.cc:135] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 18446744073709551615
Traceback (most recent call last):
File "/home_simulation_research/hbf_tensorflow_code/tf_experiments_scripts/batch_main.py", line 482, in <module>
large_main_hp.main_large_hp_ckpt(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 212, in main_large_hp_ckpt
run_hyperparam_search(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_large_hp_checkpointer.py", line 231, in run_hyperparam_search
main_hp.main_hp(arg)
File "/usr/local/lib/python3.4/dist-packages/my_tf_pkg/main_hp.py", line 258, in main_hp
with tf.Session(graph=graph) as sess:
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1186, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 551, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
不知道这是否重要,但我有一个for循环,我建立不同的图形。所以我测试说3个模型,首先我训练第一个,然后是第二个,然后是最后一个。可能是错误的原因? –
很有可能“默认情况下,TensorFlow映射几乎所有的GPU内存”,因此您需要确保您正确配置会话。 https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth –
per_process_gpu_memory_fraction是你可能想要的。 config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.4 session = tf.Session(config = config,...) –