为什么强化学习算法主流框架是actor-critic而不是将actor-critic合并成只有critic的框架,使用q值通过softmax方法自动生成policy呢?

相关:

https://www.reddit.com/r/reinforcementlearning/comments/uqsh86/whats_the_point_of_the_actor_in_actor_critic/




image




What's the point of the actor in actor critic

Since the actor seems to be dictated by the crictic I'm not quite sure what the point of the actor is. Ofcourse the actor acts out the action, but could you not just let the critic decide the action by for example taking the highest Q value of converting the Q values to probabilities if you would want a stochastic approach.




答案:

这个问题看着很大,很吓人,其实答案很简单,那就是q值通过softmax方法表示policy的算法早就有了,而且是很早很早之前就有了,后期这个方向没有发展是因为performance不好,没有搞下去的必要,原因就是q值在算法优化过程中的微小变化会导致policy有巨大的改变,从而导致算法收敛性及性能的下降,而且这种方法会比纯DQN的算法还要不稳定和性能低。







posted on 2026-04-21 14:42  Angry_Panda  阅读(3)  评论(0)    收藏  举报

导航