我应该如何定义我的网格世界式环境的状态

我想解决的问题其实并不是这么简单，但这是一个帮助我解决更大问题的玩具游戏。

所以我有一个5x5矩阵，所有值都等于0:

structure = np.zeros(25).reshape(5, 5)

目标是让代理将所有值变成1，所以我有：

goal_structure = np.ones(25).reshape(5, 5)

我创建了一个类Player，有5个动作，可以向左、向右、向上、向下或翻转(将值0变为1或1变为0(。对于奖励，如果代理将值0更改为1，它将获得+1奖励。如果它在中将1变成0，则获得负奖励(我尝试了从-1到0甚至-0.1的许多值(。如果它只是向左、向右、向上或向下，则获得0奖励。

因为我想把状态输入到我的神经网络，我把状态重塑如下：

reshaped_structure = np.reshape(structure, (1, 25))

然后我把代理的标准化位置添加到这个数组的末尾(因为我认为代理应该知道它在哪里(：

reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state

但是我没有得到任何好的结果！它就像它的随机！我尝试了不同的奖励函数，不同的优化算法，如经验回放、目标网、双DQN、决斗，但似乎都不起作用！我想问题出在定义状态上。有谁能帮我定义一个好的状态吗？

非常感谢！

ps：这是我的步骤函数：

class Player:
def __init__(self):
self.x = 0
self.y = 0
self.max_time_step = 50
self.time_step = 0
self.reward_list = []
self.sum_reward_list = []
self.sum_rewards = []
self.gather_positions = []
# self.dict = {}
self.action_space = spaces.Discrete(5)
self.observation_space = 27
def get_done(self, time_step):
if time_step == self.max_time_step:
done = True
else:
done = False
return done
def flip_pixel(self):
if structure[self.x][self.y] == 1:
structure[self.x][self.y] = 0.0
elif structure[self.x][self.y] == 0:
structure[self.x][self.y] = 1
def step(self, action, time_step):
reward = 0
if action == right:
if self.y < y_threshold:
self.y = self.y + 1
else:
self.y = y_threshold
if action == left:
if self.y > y_min:
self.y = self.y - 1
else:
self.y = y_min
if action == up:
if self.x > x_min:
self.x = self.x - 1
else:
self.x = x_min
if action == down:
if self.x < x_threshold:
self.x = self.x + 1
else:
self.x = x_threshold
if action == flip:
self.flip_pixel()
if structure[self.x][self.y] == 1:
reward = 1
else:
reward = -0.1

self.reward_list.append(reward)
done = self.get_done(time_step)
reshaped_structure = np.reshape(structure, (1, 25))
reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state
return state, reward, done
def reset(self):
structure = np.zeros(25).reshape(5, 5)
reset_reshaped_structure = np.reshape(structure, (1, 25))
reset_reshaped_state = np.append(reset_reshaped_structure, (0, 0))
state = reset_reshaped_state
self.x = 0
self.y = 0
self.reward_list = []
self.gather_positions = []
# self.dict.clear()
return state

我会将代理位置编码为矩阵，如下所示：

(在代理人处于中间的情况下(。当然，为了网络，你也必须把它压平。因此，您的总状态是50个输入值，其中25个用于单元格状态，25个用于代理位置。

当您将位置编码为两个浮点时，网络必须对浮点的确切值进行解码。如果你使用像上面这样的显式方案，网络会非常清楚代理的确切位置。这是位置的"一次性"编码。

例如，如果你看atari DQN论文，代理位置总是用每个可能位置的神经元显式编码。

还请注意，对于您的代理来说，一个非常好的策略是站着不动，不断地翻转状态，这样做每步可获得0.45的奖励(+1表示0比1，-0.1表示1比0，分为2步(。假设一个完美的政策，它只能获得25分，但这项政策将获得22.5分的奖励，很难忘记。我建议这位特工因未兑现好的奖励而得-1分。

您提到代理没有学习。我可以建议你尽量简化吗。第一个建议是将剧集的长度减少到2或3个步骤，并将网格的大小减少到1。查看代理是否能够学会将单元格始终设置为1。同时，尽可能简化你的代理人的大脑。将其简化为一个单独的输出层——一个具有激活的线性模型。这应该是非常快速和容易学习。如果代理没有在100集内了解到这一点，我怀疑您的RL实现中存在错误。如果有效，您可以开始扩展网格的大小和网络的大小。

相关内容

最新更新

热门标签：