强化学习：AC算法中为什么不使用Q函数来表示优势函数

强化学习中的策略梯度法（PG）不直接使用Q函数作为值函数来进行计算已经在Sutton的PG公式证明中提出，主要作用就是减少方差，因此使用优势函数进行计算。作为PG算法类中最常见的AC类算法有着较多的使用，尤其是PPO算法作为目前效果最好的强化学习算法更有着广泛使用，但是这些算法都是使用状态值函数V来进行计算优势函数A，而没有使用动作值函数Q来计算优势函数，那么为什么呢，这里给出了些解释。

--------------------------------------------------

《High-Dimensional Continuous Control Using Generalized Advantage Estimation》

1. First, the state-value function has a lower-dimensional input and is thus easier to learn than a state-action value function.

2. Second, the method of this paper allows us to smoothly interpolate between the high-bias estimator (λ = 0) and the low-bias estimator (λ = 1).

3. On the other hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found that the bias is prohibitively large when using a one-step estimate of the returns. We expect that similar difficulty would be encountered when using an advantage estimator involving a parameterized Q-function.

======================

原论文：

posted on 2023-11-29 11:39 Angry_Panda 阅读(119) 评论(0) 收藏举报

刷新页面返回顶部

Angry Panda（T-800）

强化学习：AC算法中为什么不使用Q函数来表示优势函数

公告

导航