强化学习:AC算法中为什么不使用Q函数来表示优势函数

强化学习中的策略梯度法(PG)不直接使用Q函数作为值函数来进行计算已经在Sutton的PG公式证明中提出,主要作用就是减少方差,因此使用优势函数进行计算。作为PG算法类中最常见的AC类算法有着较多的使用,尤其是PPO算法作为目前效果最好的强化学习算法更有着广泛使用,但是这些算法都是使用状态值函数V来进行计算优势函数A,而没有使用动作值函数Q来计算优势函数,那么为什么呢,这里给出了些解释。

 

 

 

--------------------------------------------------

 

 

《High-Dimensional Continuous Control Using Generalized Advantage Estimation》

 

 

 

 

 

1.  First, the state-value function has a lower-dimensional input and is thus easier to learn than a state-action value function.

2.  Second, the method of this paper allows us to smoothly interpolate between the high-bias estimator (λ = 0) and the low-bias estimator (λ = 1).

3.  On the other hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found that the bias is prohibitively large when using a one-step estimate of the returns. We expect that similar difficulty would be encountered when using an advantage estimator involving a parameterized Q-function.

 

 

======================

 

原论文:

 

posted on 2023-11-29 11:39  Angry_Panda  阅读(119)  评论(0)    收藏  举报

导航