使用深度强化学习产生多维动作空间,智能体更新如何处理

在编写客制的深度强化学习环境时,有时候需要使用到智能体多维动作空间的应用。

比如说,我们设计的环境是一个打砖块游戏,这时智能体需要产生一个[左,右,不动]的动作概率分布,智能体动作空间只产生一个维度:[0.2,0.4,0.4]
此时,我们需要设计板来打砖块,而且是一个智能体,这时候智能体产生的动作空间就会变成多维,动作概率如下所示:[[0.2,0.4,0.4],[0.2,0.4,0.4]]

那么这时候,智能体的动作概率分布有何不同,更新时会碰到什么问题呢?

在使用PPO算法更新时,首先使用环境与智能体交互,积累经验,然后进行学习。
更新时,首先利用经验计算优势函数。计算完优势函数,开始进行智能体的抽样训练,具体的写法如下:

点击查看代码
def update(self, replay_buffer, total_steps):
        s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor()  #经验存储池
        adv = []
        gae = 0
        with torch.no_grad():  # adv and v_target have no gradient
            vs = self.critic(s)
            vs_ = self.critic(s_)
            deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
            for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
                gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
                adv.insert(0, gae)
            adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
            v_target = adv + vs
            if self.use_adv_norm:  # Trick 1:advantage normalization
                adv = ((adv - adv.mean()) / (adv.std() + 1e-5))

        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            # Random sampling and no repetition. 'False' indicates that training will continue even if the number of samples in the last time is less than mini_batch_size
            for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
                dist_now = Categorical(probs=self.actor(s[index]))
                dist_entropy = dist_now.entropy().view(-1, 1)  # shape(mini_batch_size X 1)
                a_logprob_now = dist_now.log_prob(a[index].squeeze()).view(-1, 1)  # shape(mini_batch_size X 1)

                ratios = torch.exp(a_logprob_now - a_logprob[index])  # 这里的log_now产生的是[128,1]的tensor数据,而log产生的是[64,1]的tensor数据
                surr1 = ratios * adv[index]  # Only calculate the gradient of 'a_logprob_now' in ratios
                surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
                actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * dist_entropy  # shape(mini_batch_size X 1)

                # Update actor
                self.optimizer_actor.zero_grad()
                actor_loss.mean().backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.optimizer_actor.step()

                v_s = self.critic(s[index])
                critic_loss = F.mse_loss(v_target[index], v_s)

                # Update critic
                self.optimizer_critic.zero_grad()
                critic_loss.backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.optimizer_critic.step()

        if self.use_lr_decay:  # Trick 6:learning rate Decay
            self.lr_decay(total_steps)

涉及到多维动作空间时,智能体更新部分,产生pi的比值时,智能体会将抽取的状态送入策略网络,产生新的策略,存储的策略作为旧策略。
但是此时旧策略的尺寸仅为[batch_size,1](因为提前将经验存储的原因,这里在抽取时大小与batch_size是相同的),而新产生的策略尺寸为[batch_size*2,1],因此会产生以下错误:

点击查看代码
RuntimeError: The size of tensor a (batch_size*2) must match the size of tensor b (batch_size) at non-singleton dimension 0

综上,涉及多维动作空间时,会产生旧策略与新策略尺寸不匹配的问题,此时也不能将新策略强行降为[batch_size,1],否则会产生数据错乱。
所以如何在保持多维动作空间特征的同时,还能保持策略大小匹配,就需要使用到联合分布。
使用联合分布来表示多维动作的空间,将二维的动作转化成一个联合分布的点,可解决上述问题。
在pytorch中,已经内置了联合分布函数:torch.distributions.Independent
在具体操作中,针对上述更新过程,作如下改动:

点击查看代码
    def update(self, replay_buffer, total_steps):
        s, a, a_logprob, r, s_, dw, done = replay_buffer.numpy_to_tensor()  # Get training data
        a_logprob = torch.sum(a_logprob, dim=1, keepdim=True)               #结合动作分布
        # Calculate advantages using GAE
        adv = []
        gae = 0
        with torch.no_grad():
            vs = self.critic(s)
            vs_ = self.critic(s_)
            deltas = r + self.gamma * (1.0 - dw) * vs_ - vs
            for delta, d in zip(reversed(deltas.flatten().numpy()), reversed(done.flatten().numpy())):
                gae = delta + self.gamma * self.lamda * gae * (1.0 - d)
                adv.insert(0, gae)
            adv = torch.tensor(adv, dtype=torch.float).view(-1, 1)
            v_target = adv + vs
            if self.use_adv_norm:  # Trick 1: advantage normalization
                adv = (adv - adv.mean()) / (adv.std() + 1e-5)

        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            for index in BatchSampler(SubsetRandomSampler(range(self.batch_size)), self.mini_batch_size, False):
                action_mean_now = self.actor(s[index])
                dist_now = Categorical(probs=action_mean_now)
                independent_dist = Independent(dist_now, reinterpreted_batch_ndims=1) #将产生的多维动作空间联合
                entropy = independent_dist.entropy().mean()                           #计算分布的熵,并取其均值。

                a_logprob_now = independent_dist.log_prob(a[index].squeeze()).view(-1, 1)
                index_tensor = torch.tensor(index, dtype=torch.int64)
                index=torch.clamp(index_tensor,0,a_logprob.shape[0]-1)
                a_logprob = a_logprob[index].view(-1, 1)

                ratios = torch.exp(a_logprob_now - a_logprob)
                surr1 = ratios * adv[index]
                surr2 = torch.clamp(ratios, 1 - self.epsilon, 1 + self.epsilon) * adv[index]
                actor_loss = -torch.min(surr1, surr2) - self.entropy_coef * entropy

                self.optimizer_actor.zero_grad()
                actor_loss.mean().backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
                self.optimizer_actor.step()

                v_s = self.critic(s[index])
                critic_loss = F.mse_loss(v_target[index], v_s)

                self.optimizer_critic.zero_grad()
                critic_loss.backward()
                if self.use_grad_clip:  # Trick 7: Gradient clip
                    torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 0.5)
                self.optimizer_critic.step()

        if self.use_lr_decay:  # Trick 6: learning rate Decay
            self.lr_decay(total_steps)

以上操作解决多维动作空间更新时动作概率分布size不匹配的问题。

posted @ 2024-08-18 17:20  Wonx3  阅读(170)  评论(0)    收藏  举报