1

我正试图使用​​Keras & Tensorflow来实现Actor-Critic。 然而,它从来没有收敛,我不明白为什么。我减少了学习速度,但没有改变。演员 - 评论家模型从未收敛

的代码是python3.5.1和tensorflow1.2.1

import gym 
import itertools 
import matplotlib 
import numpy as np 
import sys 
import tensorflow as tf 
import collections 

from keras.models import Model 
from keras.layers import Input, Dense 
from keras.utils import to_categorical 
from keras import backend as K 

env = gym.make('CartPole-v0') 
NUM_STATE = env.env.observation_space.shape[0] 
NUM_ACTIONS = env.env.action_space.n 

LEARNING_RATE = 0.0005 

TARGET_AVG_REWARD = 195 

class Actor_Critic(): 

    def __init__(self): 
     l_input = Input(shape=(NUM_STATE,)) 
     l_dense = Dense(16, activation='relu')(l_input) 

     ## Policy Network 
     action_probs = Dense(NUM_ACTIONS, activation='softmax')(l_dense) 
     policy_network = Model(input=l_input, output=action_probs) 

     ## Value Network 
     state_value = Dense(1, activation='linear')(l_dense) 
     value_network = Model(input=l_input, output=state_value) 

     graph = self._build_graph(policy_network, value_network) 
     self.state, self.action, self.target, self.action_probs, self.state_value, self.minimize, self.loss = graph 

    def _build_graph(self, policy_network, value_network): 
     state = tf.placeholder(tf.float32) 
     action = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS)) 
     target = tf.placeholder(tf.float32, shape=(None)) 

     action_probs = policy_network(state) 
     state_value = value_network(state)[0] 
     advantage = tf.stop_gradient(target) - state_value 

     log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1)) 
     p_loss = -log_prob * advantage 
     v_loss = tf.reduce_mean(tf.square(advantage)) 
     loss = p_loss + (0.5 * v_loss) 

     # optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=.99) 
     optimizer = tf.train.AdamOptimizer(LEARNING_RATE) 
     minimize = optimizer.minimize(loss) 

     return state, action, target, action_probs, state_value, minimize, loss, 

    def predict_policy(self, sess, state): 
     return sess.run(self.action_probs, { self.state: [state] }) 

    def predict_value(self, sess, state): 
     return sess.run(self.state_value, { self.state: [state] }) 

    def update(self, sess, state, action, target): 
     feed_dict = {self.state:[state], self.target:target, self.action:to_categorical(action, NUM_ACTIONS)} 
     _, loss = sess.run([self.minimize, self.loss], feed_dict) 
     return loss 


def train(env, sess, estimator, num_episodes, discount_factor=1.0): 

    Transition = collections.namedtuple("Transition", ["state", "action", "reward", "loss"]) 

    last_100 = np.zeros(100) 

    for i_episode in range(num_episodes): 
     # Reset the environment and pick the fisrst action 
     state = env.reset() 

     episode = [] 

     # One step in the environment 
     for t in itertools.count(): 

      # Take a step 
      action_probs = estimator.predict_policy(sess, state)[0] 
      action = np.random.choice(np.arange(len(action_probs)), p=action_probs) 
      next_state, reward, done, _ = env.step(action) 

      target = reward + (0 if done else discount_factor * estimator.predict_value(sess, next_state)) 

      # Update our policy estimator 
      loss = estimator.update(sess, state, action, target) 

      # Keep track of the transition 
      episode.append(Transition(state=state, action=action, reward=reward, loss=loss)) 

      if done: 
       break 

      state = next_state 

     total_reward = sum(e.reward for e in episode) 
     last_100[i_episode % 100] = total_reward 
     last_100_avg = sum(last_100)/100 
     total_loss = sum(e.loss for e in episode) 
     print('episode %s loss: %f reward: %f last 100: %f' % (i_episode, total_loss, total_reward, last_100_avg)) 

     if last_100_avg >= TARGET_AVG_REWARD: 
      break 

    return 

estimator = Actor_Critic() 

with tf.Session() as sess: 
    sess.run(tf.global_variables_initializer()) 
    stats = train(env, sess, estimator, 2000, discount_factor=0.99) 

这里是日志的开头:。(过去的100是最后的100个集的平均报酬它在第一个自动增加100集如此忽略它。)

episode 0 loss: 17.662344 reward: 15.000000 last 100: 0.150000 
episode 1 loss: 15.319713 reward: 13.000000 last 100: 0.280000 
episode 2 loss: 38.097054 reward: 32.000000 last 100: 0.600000 
episode 3 loss: 22.229492 reward: 19.000000 last 100: 0.790000 
episode 4 loss: 31.027534 reward: 26.000000 last 100: 1.050000 
episode 5 loss: 21.037663 reward: 18.000000 last 100: 1.230000 
episode 6 loss: 18.750641 reward: 16.000000 last 100: 1.390000 
episode 7 loss: 23.268227 reward: 20.000000 last 100: 1.590000 
episode 8 loss: 27.251028 reward: 23.000000 last 100: 1.820000 
episode 9 loss: 20.008078 reward: 17.000000 last 100: 1.990000 
episode 10 loss: 28.213932 reward: 24.000000 last 100: 2.230000 
episode 11 loss: 28.109922 reward: 23.000000 last 100: 2.460000 
episode 12 loss: 25.068121 reward: 21.000000 last 100: 2.670000 
episode 13 loss: 59.581238 reward: 50.000000 last 100: 3.170000 
episode 14 loss: 26.618759 reward: 22.000000 last 100: 3.390000 
episode 15 loss: 28.847467 reward: 24.000000 last 100: 3.630000 
episode 16 loss: 22.534216 reward: 17.000000 last 100: 3.800000 
episode 17 loss: 19.760979 reward: 15.000000 last 100: 3.950000 
episode 18 loss: 31.018209 reward: 25.000000 last 100: 4.200000 
episode 19 loss: 22.938683 reward: 16.000000 last 100: 4.360000 
episode 20 loss: 30.372072 reward: 24.000000 last 100: 4.600000 

经过500集,不仅没有改善,但实际上它比开始更糟糕。

episode 501 loss: 97.043335 reward: 8.000000 last 100: 13.500000 
episode 502 loss: 101.957603 reward: 11.000000 last 100: 13.510000 
episode 503 loss: 100.277809 reward: 11.000000 last 100: 13.520000 
episode 504 loss: 96.754257 reward: 9.000000 last 100: 13.510000 
episode 505 loss: 99.436943 reward: 11.000000 last 100: 13.530000 
episode 506 loss: 105.161621 reward: 16.000000 last 100: 13.580000 
episode 507 loss: 65.993591 reward: 12.000000 last 100: 13.610000 
episode 508 loss: 59.837429 reward: 9.000000 last 100: 13.600000 
episode 509 loss: 92.478806 reward: 9.000000 last 100: 13.570000 
episode 510 loss: 96.697289 reward: 14.000000 last 100: 13.620000 
episode 511 loss: 94.611366 reward: 10.000000 last 100: 13.620000 
episode 512 loss: 100.259460 reward: 15.000000 last 100: 13.680000 
episode 513 loss: 88.776451 reward: 10.000000 last 100: 13.690000 
episode 514 loss: 86.659203 reward: 9.000000 last 100: 13.700000 
episode 515 loss: 105.494476 reward: 17.000000 last 100: 13.770000 
episode 516 loss: 90.662186 reward: 12.000000 last 100: 13.770000 
episode 517 loss: 90.777634 reward: 12.000000 last 100: 13.810000 
episode 518 loss: 91.290558 reward: 14.000000 last 100: 13.860000 
episode 519 loss: 94.902023 reward: 11.000000 last 100: 13.870000 
episode 520 loss: 86.746582 reward: 12.000000 last 100: 13.900000 

另一方面,明确的政策梯度并没有收敛。

import gym 
import itertools 
import matplotlib 
import numpy as np 
import sys 
import tensorflow as tf 
import collections 

from keras.models import Model 
from keras.layers import Input, Dense 
from keras.utils import to_categorical 
from keras import backend as K 

env = gym.make('CartPole-v0') 
NUM_STATE = env.env.observation_space.shape[0] 
NUM_ACTIONS = env.env.action_space.n 

LEARNING_RATE = 0.0005 

TARGET_AVG_REWARD = 195 

class PolicyEstimator(): 
    """ 
    Policy Function approximator. 
    """ 

    def __init__(self): 
     l_input = Input(shape=(NUM_STATE,)) 
     l_dense = Dense(16, activation='relu')(l_input) 
     action_probs = Dense(NUM_ACTIONS, activation='softmax')(l_dense) 
     model = Model(inputs=[l_input], outputs=[action_probs]) 

     self.state, self.action, self.target, self.action_probs, self.minimize, self.loss = self._build_graph(model) 

    def _build_graph(self, model): 
     state = tf.placeholder(tf.float32) 
     action = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS)) 
     target = tf.placeholder(tf.float32, shape=(None)) 

     action_probs = model(state) 

     log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1)) 
     loss = -log_prob * target 

     # optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=.99) 
     optimizer = tf.train.AdamOptimizer(LEARNING_RATE) 
     minimize = optimizer.minimize(loss) 

     return state, action, target, action_probs, minimize, loss 

    def predict(self, sess, state): 
     return sess.run(self.action_probs, { self.state: [state] }) 

    def update(self, sess, state, action, target): 
     feed_dict = {self.state:[state], self.target:[target], self.action:to_categorical(action, NUM_ACTIONS)} 
     _, loss = sess.run([self.minimize, self.loss], feed_dict) 
     return loss 


def train(env, sess, estimator_policy, num_episodes, discount_factor=1.0): 

    Transition = collections.namedtuple("Transition", ["state", "action", "reward"]) 

    last_100 = np.zeros(100) 

    for i_episode in range(num_episodes): 
     # Reset the environment and pick the fisrst action 
     state = env.reset() 

     episode = [] 

     # One step in the environment 
     for t in itertools.count(): 

      # Take a step 
      action_probs = estimator_policy.predict(sess, state)[0] 
      action = np.random.choice(np.arange(len(action_probs)), p=action_probs) 
      next_state, reward, done, _ = env.step(action) 

      # Keep track of the transition 
      episode.append(Transition(state=state, action=action, reward=reward)) 

      if done: 
       break 

      state = next_state 

     # Go through the episode and make policy updates 
     for t, transition in enumerate(episode): 
      # The return after this timestep 
      target = sum(discount_factor**i * t2.reward for i, t2 in enumerate(episode[t:])) 
      # Update our policy estimator 
      loss = estimator_policy.update(sess, transition.state, transition.action, target) 

     total_reward = sum(e.reward for e in episode) 
     last_100[i_episode % 100] = total_reward 
     last_100_avg = sum(last_100)/100 
     print('episode %s reward: %f last 100: %f' % (i_episode, total_reward, last_100_avg)) 

     if last_100_avg >= TARGET_AVG_REWARD: 
      break 

    return 

policy_estimator = PolicyEstimator() 

with tf.Session() as sess: 
    sess.run(tf.global_variables_initializer()) 
    stats = train(env, sess, policy_estimator, 2000, discount_factor=1.0) 

参考代码

https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py

https://github.com/coreylynch/async-rl

任何帮助理解。

[更新]

我改变了代码_build_graph

advantage = tf.stop_gradient(target) - state_value 

log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1)) 
p_loss = -log_prob * advantage 
v_loss = tf.reduce_mean(tf.square(advantage)) 
loss = p_loss + (0.5 * v_loss) 

advantage = target - state_value 

log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1)) 
p_loss = -log_prob * tf.stop_gradient(advantage) 
v_loss = 0.5 * tf.reduce_mean(tf.square(advantage)) 
loss = p_loss + v_loss 

它变得更好,打200分的奖励(最大)很多。但是,在4000集之后,它仍然没有达到195的平均水平。

回答

1

第一个明显的事情是在利用停止错误的梯度:

advantage = tf.stop_gradient(target) - state_value 

应该

advantage = target - tf.stop_gradient(state_value) 

由于没有梯度的目标无论哪种方式(它是一个常数),你想要什么要实现的是在策略梯度的价值网络(基线)上没有梯度流动。你有一个单独的损失的基准(这看起来很好)。

另一个可能的错误是减少损失。您明确地为v_loss调用reduce_mean,但从不为p_loss。因此,缩放会关闭,并且您的价值网络可能会学得更慢(因为您的平均值在第一个 - 可能是时间维度上)。

+0

谢谢!我更新了这篇文章。 – Kenzo

+0

也已更新的答案 – lejlot

+0

这不应该是个问题,因为我正在更新一个步骤,而不是像参考实现中的N步骤。所以reduce_mean只是减少一个维度。但遗憾的是代码混乱。 – Kenzo

0

几个建议:

  • 负回报。你需要以某种方式“惩罚”终端国家。 (例如,您可以用if done: reward = -10硬编码奖励),否则评论者将永远不会估计终端状态的负值。没有消极的价值观,坏的行为永远不会被阻止。
  • 最小化损失。你应该有两个minimize行动:一个是演员,一个是评论家。理想情况下,评论者的学习速度应该比演员快。