2020 年 3月 19 日随笔档案 - yingfengwu

2020年3月19日

摘要： TD Learning(时间差分学习)是RL的核心。 Sutton（1988）提出TD通常对值函数的估计会涉及到学习方法。 Q-learning (Watkins and Dayan, 1992)和SARSA (Rummery and Niranjan, 1994)是时间差分控制方法。 TD lea 阅读全文

posted @ 2020-03-19 11:46 yingfengwu 阅读(283) 评论(0) 推荐(0)

Exploration与Exploitation相关解释

摘要： RL智能体需要在不确定策略的探索（Exploration）和当前策略的开采(Exploitation)之间进行权衡。智能体会选择贪婪参数，范围在（0，1）上，通常值接近0。智能体会对当前状态s用的概率采取贪婪行为，用的概率采取随机行为。那就是智能体用的概率开采(Exploitation)当前最阅读全文

posted @ 2020-03-19 11:23 yingfengwu 阅读(1061) 评论(0) 推荐(0)

贝尔曼方程

摘要： http://www.atyun.com/10331.html 阅读全文

posted @ 2020-03-19 09:55 yingfengwu 阅读(184) 评论(0) 推荐(0)

yingfengwu

The so-called excellent person is to let the world, because with me, be a little different.

公告