构建用于原材料采购的开放人工智能RL环境的奖励函数



我正在尝试深度强化学习,并在我运行购买原材料模拟的环境中创建了以下内容。起始数量是我在未来12周内购买的材料数量(sim_weeks(。我必须购买195000英镑的倍数,预计每周使用45000英镑的材料。

start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000

class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1

def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty

#see if we need to buy
self.state += (action*purchase_mult)


#now calculate the days on hand from this:
days = self.state/forecast_qty/7


# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1 
#self.current_step+=1

# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days

# Check if shower is done
if self.purchase_length <= 0: 
done = True
else:
done = False

# Set placeholder for info
info = {}

# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass

def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks

return self.state

我正在讨论奖励函数是否足够。我试图做的是最小化每个步骤的手头天数之和,其中给定步骤的手头日期由代码中的天数定义。我决定,由于目标是最大化奖励函数,因此我可以将手头天数值转换为负数,然后使用新的负数作为奖励(因此,最大化奖励将使手头天数最小化(。然后我添加了一个强大的惩罚,让任何一周的可用数量为负。

有更好的方法吗?我是这个主题的新手,也是Python的新手。非常感谢任何建议!I

我认为你应该考虑减少奖励的规模。请查看此处和此处以稳定神经网络中的训练。如果RL代理的唯一任务是最大限度地减少手头天数,那么奖励系统是有意义的。只需要一点正常化!

最新更新