DQN not converging

我正试图在openai gym的"月球着陆器"；环境

经过3000集的训练，它没有收敛的迹象。(相比之下，一个非常简单的策略梯度方法在2000次发作后收敛(

我检查了好几次代码，但都找不到哪里出了问题。我希望这里有人能指出问题所在。下面是我的代码：

我使用一个简单的完全连接的网络：

class Net(nn.Module):
def __init__(self) -> None:
super().__init__()
self.main = nn.Sequential(
nn.Linear(8, 16),
nn.ReLU(),
nn.Linear(16, 16),
nn.ReLU(),
nn.Linear(16, 4)
)
def forward(self, state):
return self.main(state)

我在选择动作时使用ε贪婪，ε(从0.5开始(随着时间的推移呈指数级下降：

def sample_action(self, state):
self.epsilon = self.epsilon * 0.99
action_probs = self.network_train(state)
random_number = random.random()
if random_number < (1-self.epsilon):
action = torch.argmax(action_probs, dim=-1).item()
else:
action = random.choice([0, 1, 2, 3])
return action

在训练时，我使用重放缓冲区，批量大小为64，梯度剪裁：

def learn(self):
if len(self.buffer) >= BATCH_SIZE:
self.learn_counter += 1
transitions = self.buffer.sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))
state = torch.from_numpy(np.concatenate(batch.state)).reshape(-1, 8)
action = torch.tensor(batch.action).reshape(-1, 1)
reward = torch.tensor(batch.reward).reshape(-1, 1)
state_value = self.network_train(state).gather(1, action)
next_state = torch.from_numpy(np.concatenate(batch.next_state)).reshape(-1, 8)
next_state_value = self.network_target(next_state).max(1)[0].reshape(-1, 1).detach()
loss = F.mse_loss(state_value.float(), (self.DISCOUNT_FACTOR*next_state_value + reward).float())
self.optim.zero_grad()
loss.backward()
for param in self.network_train.parameters():
param.grad.data.clamp_(-1, 1)
self.optim.step()

我也使用目标网络，其参数每100个时间步更新一次：

def update_network_target(self):
if (self.learn_counter % 100) == 0:
self.network_target.load_state_dict(self.network_train.state_dict())

顺便说一句，我使用了Adam优化器和1e-3的LR。

已解决。显然，更新目标网络的频率太高。我把它设置为每10集一集，并解决了这个问题。

相关内容

最新更新

热门标签：