将 Tensorflow 代码移植到 Keras 时出现问题



过去几天我一直在用头撞墙 - 我根本无法弄清楚。 你们中的一些好人会让我知道我做错了什么吗?

我正在尝试将代码从 https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb(用Tensorflow编写(移植到Keras。这是代码的原始部分:

class DQNetwork:
def __init__(self, state_size, action_size, learning_rate, name='DQNetwork'):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
with tf.variable_scope(name):
self.inputs_ = tf.placeholder(tf.float32, [None, *state_size], name="inputs")
self.actions_ = tf.placeholder(tf.float32, [None, 3], name="actions_")
self.target_Q = tf.placeholder(tf.float32, [None], name="target")
#First convnet: CNN => BatchNormalization => ELU; Input is 84x84x4
self.conv1 = tf.layers.conv2d(inputs = self.inputs_,
filters = 32, kernel_size = [8,8],strides = [4,4],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(), name = "conv1")
self.conv1_batchnorm = tf.layers.batch_normalization(self.conv1,training = True,
epsilon = 1e-5,name = 'batch_norm1')
self.conv1_out = tf.nn.elu(self.conv1_batchnorm, name="conv1_out")
## --> [20, 20, 32]

#Second convnet: CNN => BatchNormalization => ELU
self.conv2 = tf.layers.conv2d(inputs = self.conv1_out,
filters = 64,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv2")
self.conv2_batchnorm = tf.layers.batch_normalization(self.conv2,training = True,
epsilon = 1e-5,name = 'batch_norm2')
self.conv2_out = tf.nn.elu(self.conv2_batchnorm, name="conv2_out")
## --> [9, 9, 64]

#Third convnet: CNN => BatchNormalization => ELU
self.conv3 = tf.layers.conv2d(inputs = self.conv2_out,
filters = 128,kernel_size = [4,4],strides = [2,2],padding = "VALID",
kernel_initializer=tf.contrib.layers.xavier_initializer_conv2d(),name = "conv3")
self.conv3_batchnorm = tf.layers.batch_normalization(self.conv3,training = True,
epsilon = 1e-5,name = 'batch_norm3')
self.conv3_out = tf.nn.elu(self.conv3_batchnorm, name="conv3_out")
## --> [3, 3, 128]

self.flatten = tf.layers.flatten(self.conv3_out)
## --> [1152]

self.fc = tf.layers.dense(inputs = self.flatten,
units = 512, activation = tf.nn.elu,
kernel_initializer=tf.contrib.layers.xavier_initializer(),name="fc1")

self.output = tf.layers.dense(inputs = self.fc, kernel_initializer=tf.contrib.layers.xavier_initializer(),
units = 3, activation=None)

# Q is our predicted Q value.
self.Q = tf.reduce_sum(tf.multiply(self.output, self.actions_), axis=1)

# The loss is the difference between our predicted Q_values and the Q_target
# Sum(Qtarget - Q)^2
self.loss = tf.reduce_mean(tf.square(self.target_Q - self.Q))
self.optimizer = tf.train.RMSPropOptimizer(self.learning_rate).minimize(self.loss)
# farther below...
Qs_next_state = sess.run(DQNetwork.output, feed_dict = {DQNetwork.inputs_: next_states_mb})
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
# If we are in a terminal state, only equals reward
if terminal:
target_Qs_batch.append(rewards_mb[i])
else:
target = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(target)
targets_mb = np.array([each for each in target_Qs_batch])
loss, _ = sess.run([DQNetwork.loss, DQNetwork.optimizer],
feed_dict={DQNetwork.inputs_: states_mb,
DQNetwork.target_Q: targets_mb,
DQNetwork.actions_: actions_mb})

这是我的转换:


class DQNetworkA:
def __init__(self, state_size, action_size, learning_rate):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.model = keras.models.Sequential()
self.model.add(keras.layers.Conv2D(32, (8, 8), strides=(4, 4), padding = "VALID", input_shape=state_size))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(64, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Conv2D(128, (4, 4), strides=(2, 2), padding = "VALID"))#, kernel_initializer='glorot_normal'))
self.model.add(keras.layers.BatchNormalization(epsilon = 1e-5))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Flatten())
self.model.add(keras.layers.Dense(512))
self.model.add(keras.layers.Activation('elu'))
self.model.add(keras.layers.Dense(action_size))
self.model.compile(loss="mse", optimizer=keras.optimizers.RMSprop(lr=self.learning_rate))
print(self.model.summary())
# farther below...
Qs = DQNetwork.predict(states_mb)
Qs_next_state = DQNetwork.predict(next_states_mb)
# Set Q_target = r if the episode ends at s+1, otherwise set Q_target = r + gamma*maxQ(s', a')
for i in range(0, len(batch)):
terminal = dones_mb[i]
t = np.copy(Qs[i])
a = np.argmax(actions_mb[i])
# If we are in a terminal state, only equals reward
if terminal:
t[a] = rewards_mb[i]
else:
t[a] = rewards_mb[i] + gamma * np.max(Qs_next_state[i])
target_Qs_batch.append(t)
dbg_target_Qs_batch.append(t[a])
targets_mb = np.array([each for each in target_Qs_batch])
loss = DQNetwork.train_on_batch(states_mb, targets_mb)

其他一切都是一样的。我什至试图弄乱自定义损失函数以最大程度地减少代码的差异——但它根本不起作用!虽然原始代码很快收敛了我的 Keras 涂鸦,但似乎根本不想工作!

有人知道吗?任何提示或帮助将不胜感激...

进一步解释一下:

这是一个简单的DQN游戏Doom - 所以在大约100集(游戏(之后,模型似乎能够毫无问题地每集射击目标。损失下降,每场比赛的奖励增加 - 正如人们所期望的那样......然而,在 Keras 模型中,损失图是平坦的,奖励图是平坦的——它似乎几乎无法学到任何东西。(见下面链接的图表(

这是它的工作原理。在TF代码中,模型输出一个张量[a,b,c],其中a,b和c给出了主角可能采取的每个动作的概率(即:[左,右,射击](。然后为每个动作赋予模型奖励,因此它被传递一个目标值(target_mb,f.ex. 10(以及该动作(以actions_mb编码的独热,即 [0,1,0] - 如果这是向右移动的目标(。然后使用简单的MSE计算给定动作的模型目标值和预测值之间的差异的损失。

我做了两件事:

1(我尝试使用标准"mse"损失,就像我在这种类型的其他模型中看到的那样。为了使损失的行为方式相同,除了目标值之外,我还会向模型传递自己的输入。因此,如果模型预测 [3,4,5] 并且 [0,1,0] 的目标为 10 - 我们将 [3,10,5] 作为模型的真值传递给模型。这应该等效于 TF 模型的操作。即,10 和 4 之间的差异,平方然后平均与批次的所有差异。

2(当1(不起作用时,我尝试制作一个自定义损失函数,基本上试图尽可能模仿TF模型的行为。因此,如果模型预测 [3,4,5] 并且 [0,1,0] 的目标为 10(如上所述( - 我们将 [0,10,0] 作为真值传递给模型。然后,自定义损失函数通过一些挑剔的乘法和除法得出 10 和 4 之间的差 - 平方它并取所有平方误差的平均值,如下所示:

def custom_loss(y_true, y_pred):
isolated_truths = tf.reduce_sum(y_true, axis=1)
isolated_predictions = tf.divide(tf.reduce_sum(tf.multiply(y_true, y_pred), axis=1), isolated_truths)
delta = isolated_predictions - isolated_truths
return tf.reduce_mean(tf.square(delta))
# when training, this small modification is made to targets:
loss = DQN_Keras.train_on_batch(states_mb, targets_mb.reshape(len(targets_mb),1) * actions_mb)

它仍然不起作用(尽管您可以在图表上看到损失的行为似乎更合理!

看看图表:

TF型号:https://pasteboard.co/IN1b5MN.png

具有 MSE 损失的 Keras 模型:https://pasteboard.co/IN1kH6P.png

具有自定义损失的 Keras 模型:https://pasteboard.co/IN17ktg.png

编辑 #2 - 可运行代码

原始TF代码 - 从上面的教程复制粘贴,工作: => https://pastebin.com/QLb7nWZi

我的代码完全自定义丢失: => https://pastebin.com/3HiYg6t7

好吧,我已经完成了工作 - 通过删除批处理规范化层。现在我完全迷惑了...那么批量规范化在 Keras 和 Tensorflow 中的工作方式是否不同?还是缺少的线索是TF中这个神秘的"training=True"参数(在Keras中不存在(?

深入研究这个问题的同时,我还发现了这篇非常有用的文章,描述了如何使用多个输入(如掩码(创建高级 Keras 模型(如原始 TF 代码中一样!

https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26

最新更新