如何获得tf-agents中所有动作的概率向量?

我正在研究多武装强盗问题，使用LinearUCBAgent和LinearThompsonSamplingAgent，但它们都返回单个动作进行观察。我需要的是所有动作的概率，我可以用它来排序。

在定义代理时需要添加emit_policy_info参数。具体的值(封装在元组中)将取决于代理:LinearThompsonSamplingAgent为predicted_rewards_sampled,LinearUCBAgent为predicted_rewards_optimistic。

例如:

agent = LinearThompsonSamplingAgent(
time_step_spec=time_step_spec,
action_spec=action_spec,
emit_policy_info=("predicted_rewards_sampled")
)

然后，在推理期间，您需要访问这些字段并规范化它们(通过softmax):

action_step = agent.collect_policy.action(observation_step)
scores = tf.nn.softmax(action_step.info.predicted_rewards_sampled)

其中tf来自import tensorflow as tf和observation_step是你的观察数组封装在一个TimeStep (from tf_agents.trajectories.time_step import TimeStep)

注意:这些不是概率，它们是标准化分数;类似于全连接层的归一化输出

相关内容

最新更新

热门标签：