2017-01-23 114 views
2

我试图通过检查点保存变量,以将容错引入到我的程序中。我试图通过使用MonitoredTrainingSession函数来实现此目的。以下是我的配置: -Tensorflow:图已完成,无法修改

import tensorflow as tf 

global_step = tf.Variable(10, trainable=False, name='global_step') 
x = tf.constant(2) 

with tf.device("/job:local/task:0"): 
    y1 = tf.Variable(x + 300) 

with tf.device("/job:local/task:1"): 
    y2 = tf.Variable(x**2) 

with tf.device("/job:local/task:2"): 
    y3 = tf.Variable(5*x) 

with tf.device("/job:local/task:3"): 
    y0 = tf.Variable(x - 66) 
    y = y0 + y1 + y2 + y3 

model = tf.global_variables_initializer() 
saver = tf.train.Saver(sharded=True) 

chief = tf.train.ChiefSessionCreator(scaffold=None, master='grpc://localhost:2222', config=None, checkpoint_dir='/home/tensorflow/codes/checkpoints') 
summary_hook = tf.train.SummarySaverHook(save_steps=None, save_secs=10, output_dir='/home/tensorflow/codes/savepoints', summary_writer=None, scaffold=None, summary_op=tf.summary.tensor_summary(name="y", tensor=y)) 
saver_hook = tf.train.CheckpointSaverHook(checkpoint_dir='/home/tensorflow/codes/checkpoints', save_secs=None, save_steps=True, saver=saver, checkpoint_basename='model.ckpt', scaffold=None) 

# with tf.train.MonitoredSession(session_creator=ChiefSessionCreator,hooks=[saver_hook, summary_hook]) as sess: 

with tf.train.MonitoredTrainingSession(master='grpc://localhost:2222', is_chief=True, checkpoint_dir='/home/tensorflow/codes/checkpoints', 
    scaffold=None, hooks=[saver_hook,summary_hook], chief_only_hooks=None, save_checkpoint_secs=None, save_summaries_steps=True, config=None) as sess: 

    while not sess.should_stop(): 
     sess.run(tf.global_variables_initializer()) 

    while not sess.should_stop(): 
     result = sess.run(y) 
     print(result) 

我得到以下RuntimeError对此我无法解析: -

Traceback (most recent call last): 
    File "add_1.py", line 39, in <module> 
    sess.run(tf.global_variables_initializer()) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1187, in global_variables_initializer 
    return variables_initializer(global_variables()) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1169, in variables_initializer 
    return control_flow_ops.group(*[v.initializer for v in var_list], name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2773, in group 
    deps.append(_GroupControlDeps(dev, ops_on_device[dev])) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2721, in _GroupControlDeps 
    return no_op(name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_control_flow_ops.py", line 186, in no_op 
    result = _op_def_lib.apply_op("NoOp", name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op 
    op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2199, in create_op 
    self._check_not_finalized() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1925, in _check_not_finalized 
    raise RuntimeError("Graph is finalized and cannot be modified.") 
RuntimeError: Graph is finalized and cannot be modified. 
+0

http://stackoverflow.com/a/4332534 8/6521116 –

回答

7

的根本原因你的错误似乎是MonitoredTrainingSession已经完成(冻结)图表和您的tf.global_variable_initializer()不再能够修改它。

话虽如此,有一些需要注意多件事情:

1)你为什么在这里尝试多次初始化所有变量?

while not sess.should_stop(): 
    sess.run(tf.global_variables_initializer()) 

2)看起来有些代码已经包含在MonitoredTrainingSession中,例如, ChiefSessionCreator。您能否再请看代码(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/monitored_session.py#L243)或搜索其示例用法并查看MonitoredTrainingSession应该如何使用?

+0

对不起,我对tensorflow很陌生,因此我的代码可能非常粗糙。 1)我已经评论了初始化部分上面的while循环,所以它只运行一次。 2)即使在监视训练中指定了配置之后,我也不确定是否需要chiefsessioncreator。 当我跑步时,它实际上给出了循环中的输出252。但是当我停下来并再次运行它时,它显示出: - [http://pastebin.com/Cgk4Z9Pc](http://pastebin.com/Cgk4Z9Pc) – itsamineral

+2

当您第二次运行它时,它会尝试加载您的较早运行的检查点,缺少global_step。看看这个线程(http://stackoverflow.com/questions/36113090/tensorflow-get-the-global-step-when-restoring-checkpoints)如何保存和恢复global_step。在这里(https://github.com/tensorflow/tensorflow/blob/b00fc538638f87ac45be9105057b9865f0f9418b/tensorflow/python/training/monitored_session_test.py#L206)如何初始化一个。 – guinny

1

如果要在循环中初始化图形,可以使用该函数在循环顶部创建新图形。

import tensorflow as tf 

tf.reset_default_graph() 
tf.Graph().as_default() 
0

因为你的目的是利用MonitoredTrainingSession让你检查点,使用比你的例子更简单:

import tensorflow as tf 

global_step = tf.contrib.framework.get_or_create_global_step() 
x = tf.constant(2) 
y1 = x + 300 
y2 = x**2 
y3 = x * 5 
y0 = x - 66 
y = y0 + y1 + y2 + y3 
step = tf.assign_add(global_step, 1) 

with tf.train.MonitoredTrainingSession(checkpoint_dir='/tmp/checkpoints') as sess: 
    while not sess.should_stop(): 
     result, i = sess.run([y, step]) 
     print(result, i) 
  • 用于保存/恢复检查点的钩子是MonitoredTrainingSession为您创建。
  • 如果通过save_checkpoint_secs,您可以更改10分钟默认设置中检查点的频率。我发现更高的频率是不值得的:保存检查点不是免费的,所以非常频繁的检查点将最终放慢训练速度。
  • ChiefSessionCreator和gRPC配置仅在分布式运行时需要(请参阅here以获得对这些概念的描述。与将操作分配给特定设备类似 - 确保在使用它之前确实需要这样做,因为它可能会减慢速度你不小心
  • 你并不需要包装上的张量操作的结果与tf.Variable() - 他们已经是变量
  • 您可以通过save_summaries_steps用于监视与tensorboard训练,但默认情况下是会发生的。无论如何每100个步骤