使用Gym.spaces.Box的DQN中的连续动作和空间



我想研究3个函数,并与特定时期的成本函数进行比较。我的行动和观察空间是连续的。如何解决此错误?

import math
import numpy as np
import gym 
from gym import spaces
from gym.spaces import Tuple , Box
import gym
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense  
from tensorflow.keras.layers import Activation  
from tensorflow.keras.layers import Flatten
from tensorflow.keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
class Example3(gym.Env):
def __init__(self):
# =============================================================
self.action_space = gym.spaces.Box(
low=np.array([0,0])
,high=np.array([10,100])
,dtype=np.float32)
lower_bound=np.array([-4,-1,0],dtype=np.float32,)
upper_bound=np.array([2,1,10],dtype=np.float32,)        
self.observation_space = spaces.Box(lower_bound, upper_bound, 
dtype=np.float32)

self.time=100
self.state = None
self.x=10
#==================================
self.E1report=np.array([])
self.actionreport=np.array([])
self.E2report=np.array([])
self.E3report=np.array([])
self.Costreport=np.array([])
#==================================
def step(self, action):
#==================================
self.actionreport=np.append(self.actionreport,action)
np.savetxt('ActionReport.txt',self.actionreport)
#==================================
E1,E2,E3= self.state

self.time-=1
##=================================
## equations
#
# E1=(-np.sin(x))/2*(x)
# E2=np.sin(x)
# cost=x**2/10
# =================================


theta=action[0]+action[1]
#=============================================

E1=(-4*np.sin(theta))/(theta)
E2=np.sin(theta)
E3=(theta)**2/10
cost=np.sin(theta)+theta+theta**2/2+2022

#====================================================
self.E1report=np.append(self.E1report,E1)
np.savetxt('E1Report.txt',self.E1report)

self.E2report=np.append(self.E2report,E2)
np.savetxt('E2Report.txt',self.E2report)
self.E3report=np.append(self.E3report,E3)
np.savetxt('E3Report.txt',self.E3report)
self.Costreport=np.append(self.Costreport,cost)
np.savetxt('CostReport.txt',self.Costreport)

#====================================================
self.state = (E1,E2,E3)

Myif = bool(-cost<2025)
if Myif:
reward = 1
else:
reward = 0

if self.time==0:
done=True
else:
done=False 
info={}        
return np.array(self.state, dtype=np.float32), reward, done, info
def reset(self):

E1=np.random.uniform(-4,2)
E2=np.random.uniform(-1,1)
E3=np.random.uniform(0,10)

self.state=(E1,E2,E3)

self.time=100
done=False
return np.array(self.state, dtype=np.float32)
def render():
pass 
env=Example3()
nb_actions = env.action_space.shape[0]
# ============================
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('tanh'))
print(model.summary())
# ======================================
from rl.memory import SequentialMemory  
memory = SequentialMemory(limit=20000, window_length=1)
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
attr='eps',
value_max=1.,
value_min=.1,
value_test=.05,
nb_steps=20000)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
target_model_update=100, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae']) 
dqn.fit(env, nb_steps=10000, visualize=False, verbose=1)

在随机迭代中,我的env可以工作,但在DQN中有一个错误
DQN中的操作为0或1
和Error:TypeError:'int'对象不可下标

如前一个答案所示,DQN不是为这个动作空间设计的。然而,也有一些代理,例如Stable Baselines 3库提供了几个代理,它们可以开箱即用地处理连续操作空间。

这里给出了一个很好的概述,说明哪些代理适合于连续操作空间:https://stable-baselines3.readthedocs.io/en/master/guide/algos.html

相关内容

  • 没有找到相关文章

最新更新