2017-04-12 70 views
2

当我训练我的图形时,发现我忘记在图形中添加丢失。但是我已经训练了很长时间并且得到了一些检查点。那么我是否可以加载检查点并添加一个退出,然后继续培训?我的代码是现在这个样子:如何加载检查点文件并继续以稍微不同的图形结构进行培训

# create a graph 
vgg_fcn = fcn8_vgg_ours.FCN8VGG() 
with tf.name_scope("content_vgg"): 
    vgg_fcn.build(batch_images, train = True, debug=True) 
labels = tf.placeholder("int32", [None, HEIGHT, WIDTH]) 
# do something 
... 
##### 
init_glb = tf.global_variables_initializer() 
init_loc = tf.local_variables_initializer() 
sess.run(init_glb) 
sess.run(init_loc) 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 
ckpt_dir = "./checkpoints" 
if not os.path.exists(ckpt_dir): 
    os.makedirs(ckpt_dir) 
ckpt = tf.train.get_checkpoint_state(ckpt_dir) 
start = 0 
if ckpt and ckpt.model_checkpoint_path: 
    start = int(ckpt.model_checkpoint_path.split("-")[1]) 
    print("start by epoch: %d"%(start)) 
    saver = tf.train.Saver() 
    saver.restore(sess, ckpt.model_checkpoint_path) 
last_save_epoch = start 
# continue training 

所以,如果我改变了FCN8VGG(添加一些辍学层)的结构,然后将它使用的元文件来替换我刚刚创建的图表?如果会的话,我怎么能改变结构继续训练,而无需再次从头开始训练?

+0

在官方网站上传输学习是一个教程,如何修改模型的最后一层,但我发现没有添加图层的示例,contrib中的'graph_editor'可能有一些帮助 –

回答

2

下面是使用来自其他模型检查点的变量初始化新模型的一个简单示例。请注意,如果您只能通过variable_scopeinit_from_checkpoint,事情就会简单得多,但在此我假设原始模型并非设计时考虑到了恢复。

首先定义一个简单的模型,一些变量,并做一些训练:

import tensorflow as tf 

def first_model(): 
    with tf.Graph().as_default(): 
    fake_input = tf.constant([[1., 2., 3., 4.], 
           [5., 6., 7., 8.]]) 
    layer_one_output = tf.contrib.layers.fully_connected(
     inputs=fake_input, num_outputs=5, activation_fn=None) 
    layer_two_output = tf.contrib.layers.fully_connected(
     inputs=layer_one_output, num_outputs=1, activation_fn=None) 
    target = tf.constant([[10.], [-3.]]) 
    loss = tf.reduce_sum((layer_two_output - target) ** 2) 
    train_op = tf.train.AdamOptimizer(0.01).minimize(loss) 
    init_op = tf.global_variables_initializer() 
    saver = tf.train.Saver() 
    with tf.Session() as session: 
     session.run(init_op) 
     for i in range(1000): 
     _, evaled_loss = session.run([train_op, loss]) 
     if i % 100 == 0: 
      print(i, evaled_loss) 
     saver.save(session, './first_model_checkpoint') 

运行first_model(),培训看起来很好,我们得到写的first_model_checkpoint:

0 109.432 
100 0.0812649 
200 8.97705e-07 
300 9.64064e-11 
400 9.09495e-13 
500 0.0 
600 0.0 
700 0.0 
800 0.0 
900 0.0 

接下来,我们可以定义一个完全新的模型在不同的图中,并初始化它与该检查点的first_model共享的变量:

def second_model(): 
    previous_variables = [ 
     var_name for var_name, _ 
     in tf.contrib.framework.list_variables('./first_model_checkpoint')] 
    with tf.Graph().as_default(): 
    fake_input = tf.constant([[1., 2., 3., 4.], 
           [5., 6., 7., 8.]]) 
    layer_one_output = tf.contrib.layers.fully_connected(
     inputs=fake_input, num_outputs=5, activation_fn=None) 
    # Add a batch_norm layer, which creates some new variables. Replacing this 
    # with tf.identity should verify that the model one variables are faithfully 
    # restored (i.e. the loss should be the same as at the end of model_one 
    # training). 
    batch_norm_output = tf.contrib.layers.batch_norm(layer_one_output) 
    layer_two_output = tf.contrib.layers.fully_connected(
     inputs=batch_norm_output, num_outputs=1, activation_fn=None) 
    target = tf.constant([[10.], [-3.]]) 
    loss = tf.reduce_sum((layer_two_output - target) ** 2) 
    train_op = tf.train.AdamOptimizer(0.01).minimize(loss) 
    # We're done defining variables, now work on initializers. First figure out 
    # which variables in the first model checkpoint map to variables in this 
    # model. 
    restore_map = {variable.op.name:variable for variable in tf.global_variables() 
        if variable.op.name in previous_variables} 
    # Set initializers for first_model variables to restore them from the 
    # first_model checkpoint 
    tf.contrib.framework.init_from_checkpoint(
     './first_model_checkpoint', restore_map) 
    # For new variables, global_variables_initializer will initialize them 
    # normally. For variables in restore_map, they will be initialized from the 
    # checkpoint. 
    init_op = tf.global_variables_initializer() 
    saver = tf.train.Saver() 
    with tf.Session() as session: 
     session.run(init_op) 
     for i in range(10): 
     _, evaled_loss = session.run([train_op, loss]) 
     print(i, evaled_loss) 
     saver.save(session, './second_model_checkpoint') 

在这种情况下,previous_variables样子:

['beta1_power', 'beta2_power', 'fully_connected/biases', 'fully_connected/biases/Adam', 'fully_connected/biases/Adam_1', 'fully_connected/weights', 'fully_connected/weights/Adam', 'fully_connected/weights/Adam_1', 'fully_connected_1/biases', 'fully_connected_1/biases/Adam', 'fully_connected_1/biases/Adam_1', 'fully_connected_1/weights', 'fully_connected_1/weights/Adam', 'fully_connected_1/weights/Adam_1'] 

注意,因为我们没有使用任何变量的作用域,命名取决于顺序层定义。如果名称更改,则需要手动构建restore_map

如果我们运行second_model,损失跳起来最初因为batch_norm层没有被训练:

0 38.5976 
1 36.4033 
2 33.3588 
3 29.8555 
4 26.169 
5 22.5185 
6 19.0838 
7 16.0096 
8 13.4035 
9 11.3298 

但是,更换batch_normtf.identity验证了先前训练的变量已经恢复。

+0

非常感谢! –

相关问题