我的问题在某个区间(0,1(上有一个单一的状态和无限多的操作。在谷歌上搜索了很长一段时间后,我发现了一些关于缩放算法的论文,它可以解决具有连续动作空间的问题。然而,我的实现不善于利用。因此,我正在考虑添加一种epsilon贪婪行为。
把不同的方法结合起来合理吗?
你知道解决我问题的其他方法吗?
代码示例:
import portion as P
def choose_action(self, i_ph):
# Activation rule
not_covered = P.closed(lower=0, upper=1)
for arm in self.active_arms:
confidence_radius = calc_confidence_radius(i_ph, arm)
confidence_interval = P.closed(arm.norm_value - confidence_radius, arm.norm_value + confidence_radius)
not_covered = not_covered - confidence_interval
if not_covered != P.empty():
rans = []
height = 0
heights = []
for i in not_covered:
rans.append(np.random.uniform(i.lower, i.upper))
height += i.upper - i.lower
heights.append(i.upper - i.lower)
ran_n = np.random.uniform(0, height)
j = 0
ran = 0
for i in range(len(heights)):
if j < ran_n < j + heights[i]:
ran = rans[i]
j += heights[i]
self.active_arms.append(Arm(len(self.active_arms), ran * (self.sigma_square - lower) + lower, ran))
# Selection rule
max_index = float('-inf')
max_index_arm = None
for arm in self.active_arms:
confidence_radius = calc_confidence_radius(i_ph, arm)
# indexfunction from zooming algorithm
index = arm.avg_learning_reward + 2 * confidence_radius
if index > max_index:
max_index = index
max_index_arm = arm
action = max_index_arm.value
self.current_arm = max_index_arm
return action
def learn(self, action, reward):
arm = self.current_arm
arm.avg_reward = (arm.pulled * arm.avg_reward + reward) / (arm.pulled + 1)
if reward > self.max_profit:
self.max_profit = reward
elif reward < self.min_profit:
self.min_profit = reward
# normalize reward to [0, 1]
high = 100
low = -75
if reward >= high:
reward = 1
self.high_count += 1
elif reward <= low:
reward = 0
self.low_count += 1
else:
reward = (reward - low)/(high - low)
arm.avg_learning_reward = (arm.pulled * arm.avg_learning_reward + reward) / (arm.pulled + 1)
arm.pulled += 1
# zooming algorithm confidence radius
def calc_confidence_radius(i_ph, arm: Arm):
return math.sqrt((8 * i_ph)/(1 + arm.pulled))
您可以在这里找到这个有用的完整算法描述。他们均匀地排列探针,告知这种选择(例如,正常地以著名的高能臂为中心(也是可能的(但这可能会使我不确定的一些界限无效(。