强化学习:AC算法中为什么不使用Q函数来表示优势函数
强化学习中的策略梯度法(PG)不直接使用Q函数作为值函数来进行计算已经在Sutton的PG公式证明中提出,主要作用就是减少方差,因此使用优势函数进行计算。作为PG算法类中最常见的AC类算法有着较多的使用,尤其是PPO算法作为目前效果最好的强化学习算法更有着广泛使用,但是这些算法都是使用状态值函数V来进行计算优势函数A,而没有使用动作值函数Q来计算优势函数,那么为什么呢,这里给出了些解释。
--------------------------------------------------
《High-Dimensional Continuous Control Using Generalized Advantage Estimation》

1. First, the state-value function has a lower-dimensional input and is thus easier to learn than a state-action value function.
2. Second, the method of this paper allows us to smoothly interpolate between the high-bias estimator (λ = 0) and the low-bias estimator (λ = 1).
3. On the other hand, using a parameterized Q-function only allows us to use a high-bias estimator. We have found that the bias is prohibitively large when using a one-step estimate of the returns. We expect that similar difficulty would be encountered when using an advantage estimator involving a parameterized Q-function.
======================
原论文:

posted on 2023-11-29 11:39 Angry_Panda 阅读(119) 评论(0) 收藏 举报
 
                    
                     
                    
                 
                    
                 
                
            
         
         
 浙公网安备 33010602011771号
浙公网安备 33010602011771号