Hugh_Cai - 博客园

摘要： Dictum: Books are the ever-burning lamps of accumulated wisdom. --G.W.Curtis 阅读全文

posted @ 2020-01-22 22:22 Hugh_Cai 阅读(107) 评论(0) 推荐(1) 编辑

2020年11月25日

3. Distributional Reinforcement Learning with Quantile Regression

摘要： C51算法理论上用Wasserstein度量衡量两个累积分布函数间的距离证明了价值分布的可行性，但在实际算法中用KL散度对离散支持的概率进行拟合，不能作用于累积分布函数，不能保证Bellman更新收敛；且C51算法使用价值分布的若干个固定离散支持，通过调整它们的概率来构建价值分布。而分位数回归(q 阅读全文

posted @ 2020-11-25 18:47 Hugh_Cai 阅读(544) 评论(0) 推荐(1) 编辑

2. A Distributional Perspective on Reinforcement Learning

摘要：本文主要研究了分布式强化学习，利用价值分布(value distribution)的思想，求出回报$Z$的概率分布，从而取代期望值(即$Q$值)。 Q-Learning Q-Learning的目标是近似Q函数，即在策略$\pi$下回报$Z_t$的期望值： \(Q^{\pi}(s,a)=\mathbb 阅读全文

posted @ 2020-11-25 18:46 Hugh_Cai 阅读(339) 评论(1) 推荐(1) 编辑

1. Deep Q-Learning

摘要：传统的强化学习算法具有很强的决策能力，但难以用于高维空间任务中，需要结合深度学习的高感知能力，因此延展出深度强化学习，最经典的就是DQN(Deep Q-Learning)。 DQN 2013 DQN的主要思想是训练CNN拟合出Q-Learning算法，以此让智能体在复杂的RL环境中从原始视频数据学到阅读全文

posted @ 2020-11-25 18:45 Hugh_Cai 阅读(396) 评论(0) 推荐(1) 编辑

Ⅶ. Policy Gradient Methods

摘要： Dictum: Life is just a series of trying to make up your mind. -- T. Fuller 不同于近似价值函数并以此计算确定性的策略的基于价值的RL方法，基于策略的RL方法将策略的学习从概率集合$P(a|s)$变换成策略函数$\pi(a| 阅读全文

posted @ 2020-11-25 16:07 Hugh_Cai 阅读(322) 评论(0) 推荐(1) 编辑

2020年11月4日

Ⅴ. Temporal-Difference Learning

摘要： Dictum: Although the world is full of suffering, it is full also of the overcoming of it. -- Helen Keller 时序差分学习(Temporal-Difference Learning, TD)结合了D 阅读全文

posted @ 2020-11-04 21:46 Hugh_Cai 阅读(166) 评论(0) 推荐(0) 编辑

2020年4月30日

Ⅳ. Monte Carlo Methods

摘要： Dictum: Nutrition books in the world. There is no book in life, there is no sunlight; wisdom without books, as if the birds do not have wings. -- Shak 阅读全文

posted @ 2020-04-30 20:58 Hugh_Cai 阅读(394) 评论(0) 推荐(1) 编辑

2020年4月24日

Ⅲ. Dynamic Programming

摘要： Dictum: A man who is willing to be a slave, who does not know the power of freedom. -- Beck 动态规划(Dynamic Programming, DP)是基于模型的方法，即在给定一个利用MDP描述的完备的环境模阅读全文

posted @ 2020-04-24 23:05 Hugh_Cai 阅读(238) 评论(0) 推荐(1) 编辑

2020年4月12日

Ⅱ. Finite Markov Decision Processes

摘要： Dictum: Is the true wisdom fortitude ambition. -- Napoleon 马尔可夫决策过程(Markov Decision Processes, MDPs)是一种对序列决策问题的解决工具，在这种问题中，决策者以序列方式与环境交互。 “智能体-环境”交互的过阅读全文

posted @ 2020-04-12 23:13 Hugh_Cai 阅读(512) 评论(0) 推荐(1) 编辑

2020年4月10日

Ⅰ. Introduction to Reinforcement Learning

摘要： Dictum: To spark, often burst in hard stone. -- William Liebknecht 强化学习(Reinforcement Learning)是模仿人类的学习方式（比如，学习一种新的技能，从入门到掌握总是不断地去寻错，改正，直至完全掌握），强化学习的主阅读全文

posted @ 2020-04-10 13:27 Hugh_Cai 阅读(538) 评论(0) 推荐(1) 编辑

Cai_Blog

Nothing is easier to succeed than learning.

公告