RL +优化:如何做得更好?

我正在学习如何使用强化学习进行优化。我选择了二部图中的最大匹配问题，因为我可以很容易地计算出真正的最优。

回想一下，图中的匹配是没有两条边在同一节点/顶点上发生事件的边的子集。目标是找到最大的子集。

我在下面展示了我的完整代码，但首先让我解释一下其中的一部分。

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

这将生成一个随机二部图，在两个节点集中每个节点都有1000个节点。然后打印出真正的最大匹配值的大小。

在下面的代码中，self.agent_pos是表示找到的当前匹配的数组。它的长度是原始图的边数，如果包含i，则索引i处为1，否则为0。self.matching为生长匹配中的边的集合。self.matching_nodes是生长匹配中的节点集合，用于检查是否可以添加特定的边。

import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
class MaxMatchEnv(gym.Env):
metadata = {'render.modes': ['console']}
def __init__(self, array_length=10):
super(MaxMatchEnv, self).__init__()
# Size of the 1D-grid
self.array_length = array_length
self.agent_pos = [0]*array_length
self.action_space = spaces.Discrete(array_length)
self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
self.matching = set()  # set of edges
self.matching_nodes = set() # set of node ids (ints)
self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
self.best_found = 0

def reset(self):
# Initialize the array to have random values
self.time = 0
#print(self.agent_pos)
self.agent_pos = [0]*self.array_length
self.matching = set()
self.matching_nodes = set()
return np.array(self.agent_pos)


def step(self, action):
self.time += 1 
reward = 0
edge = g.es[action]
if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
self.matching.add(edge)
self.matching_nodes.add(edge.source)
self.matching_nodes.add(edge.target)
self.agent_pos[action] = 1
if sum(self.agent_pos) > self.best_found:
self.best_found = sum(self.agent_pos)
print("New max", self.best_found)
reward = 1
elif self.agent_pos[action] == 1:
#print("Removing edge", action)
self.matching_nodes.remove(edge.source)
self.matching_nodes.remove(edge.target)
self.matching.remove(edge)
self.agent_pos[action] = 0
reward = -1
done = sum(self.agent_pos) == self.matching_size
info = {}
return np.array(self.agent_pos), reward, done, info
def render(self, mode='console'):
print(sum(self.agent_pos))
def close(self):
pass

if __name__ == '__main__':

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)
model = PPO('MlpPolicy', env, verbose=1).learn(10000000)

这有很多问题，但主要的一个是它没有很好地优化。这段代码给出了550多一点，然后停止改进，真正的最佳值超过900(它在开始时由代码打印出来)。

主要问题是:

如何才能做得更好，从而得到更好的匹配?

一个附属问题是，我如何打印到目前为止找到的最佳匹配?我尝试使用自我。Best_found保持最好的分数不起作用，因为它似乎被定期重置。

没有帮助的更改

改变DQN的PPO仅产生微小差异。
我试着改变代码，使done在1000步后为真。

修改如下:

if self.time == 1000:
done = True
else:
done = False

用print(max(env.get_attr("best_found")))代替print("New max", self.best_found)，对done的修改没有任何好处。

要打印每个环境的最大值，您可以从稳定基线使用get_attr方法。更多信息见官方文档。

例如，下面的行将打印12个环境中的每个环境的最大值，然后是所有环境中的最大值。

print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))

关于为什么它不收敛，这可能是由于一个糟糕的奖励选择，尽管看看你的奖励选择似乎是合理的。我在你的代码中添加了一个调试打印，看看是否有一些步骤导致done = True，但似乎环境从未达到这种状态。我认为，对于模型来说，有多个动作序列导致具有done = True的状态是很好的，这意味着模型将经历一个情节的结束。我没有详细研究你的代码中的问题，但也许这些信息可以帮助调试你的问题。

如果我们将问题与其他环境(如《CartPole》)进行比较，在那里我们有以done = True结尾的情节，这有助于模型学习更好的策略(在你的情况下，你可以限制每集的行动数量，而不是永远运行相同的情节)。这可以帮助模型避免陷入局部最优，因为你给了它"再次尝试"的机会;在新一集

如果我没看错的话，那么您没有执行任何超参数调优。您如何知道您的学习率/策略更新率/任何其他可变参数是否适合手头的问题?

看起来稳定基线已经内置了这个功能:https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune hyperparameter-optimization

相关内容

最新更新

热门标签：