【强化学习的数学原理】课程笔记（一）——基本概念

1. 引言

前两本书偏文字性介绍，后两本具有较强的数学性，较为难懂！

监督学习、无监督学习主要用来做分类、回归，强化学习关注于决策

RL和控制系统的架构图

2. 基本概念

State: The status of agent with respect to the environment. In this example, there are nine possible states.

States space: the set of all states $S = \{s_i\}_{i=1}^{9}$

Action: 在每个state可以采取的行动，这里每个状态有5个Action

Action space of a state: 每个状态可能采取行动的集合， $\mathcal{A}(s_i)=\{a_i\}$

State transition: 状态转移 $s_1 \overset{a_2}{\longrightarrow} s_2$

Forbidden area: is accessible but with penalty or is inaccessible

Tabular representation: 表格形式表达状态转移，只能表达确定性情况

State transition probability: 用条件概率表达状态转移 $p(s_2|s_1,a_2)=1$ ，既可以描述确定性情况，也可以描述随机转移。

Policy: tell the agent what actions to take at a state, 直观表示可以用箭头描述，数学表达使用条件概率， $\pi(a_1|s_1) = 0,\quad \pi(a_2|s_2)=1$ ，同样可以表示非确定性情况。表格表现如下,

Reward: get a scalar when agent take an action。正数代表鼓励，负数代表惩罚。 0代表无判定，一定程度上代表鼓励。

can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
表格表示
条件概率表示 $p(r=1|s_1, a_1) = 1$
依赖于当前的state和action，而不是下一个状态。比如在 $s_1$ 采取 $a_1 \text{和} a_5$ 这两个不同的action，但是下一状态都是 $s_1$ ，但获得的收益是不一样的， $a_1$ 试图触碰边界，给予惩罚， $a_5$ 原地停留，不给予惩罚。

Trajectory: is a state-action -reward chain, $s_1 \xrightarrow[r=0]{a_2} s_2\xrightarrow[r=0]{a_3} s_5 \xrightarrow[r=0]{a_3} s_8 \xrightarrow[r=1]{a_2} s_9$

Return: The return of this trajectory is the sum of all the rewards collected along the trajectory: $re t u r n = 0 + 0 + 0 + 1 = 1$

Discounted return:

discount rate: $\gamma \in \left[0,1 \right)$
discounted return: $0+\gamma 0+\gamma^2 0+\gamma^3 1 =\gamma^3$
$\gamma$ 接近0时，return 取决于最初获得的收益，将会更加关注近期reward， $\gamma$ 接近1时，return 取决于累计收益，将会更加关注长远reward。

Episode: when interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode(or a trail) .

continuing tasks: no terminal states

采取措施将 continuing tasks 转换为 episodic tasks，方式1，进入目标状态之后不再离开，将rewards 设置为0；方式2，将目标态当作一般状态处理，采取可能的策略还可以离开。本课程采用方式2处理。

3. MDP

3.1 MDP的关键要素

集合 Sets
- State: the set of state $\mathcal{S}$
- Action: the set of actions $\mathcal{A}(s)$ is associated for state $\in \mathcal{S}$
- Reward: the set of reward $\mathcal{R}(s, a)$
概率分布 Probability distribution
- state transition probability: at state s, taking action a, the probability to transit to state s’ is $p (s^{'} ∣ s, a)$
- Reward probability: at state s, taking action a, the probability to get reward r is $p (r ∣ s, a)$
策略 policy—— at state s, the probability to choose action a is $\pi(a|s)$
Markov property—— memoryless property
$p(s_{t+1}|a_{t+1},s_t,\cdots,a_1,s_0) = p(s_{t+1}|a_{t+1},s_{t}) \\ p(r_{t+1}|a_{t+1},s_t,\cdots,a_1,s_0) = p(r_{t+1}|a_{t+1},s_{t})$

3.2 举例

posted @ 2023-03-11 20:38 iailab 阅读(117) 评论(0) 收藏举报来源

刷新页面返回顶部

初心小言