【强化学习的数学原理】课程笔记(四)——值迭代和策略迭代
目录
说明:本内容为个人自用学习笔记,整理自b站西湖大学赵世钰老师的 【强化学习的数学原理】课程,特别感谢老师分享讲解如此清楚的课程。
1. 值迭代 Value Iteration
v k + 1 = f ( v k ) = max π ( r π + γ P π v k ) k = 1 , 2 , 3 , ⋯ \begin{aligned} v_{k+1} = f(v_k) =\max\limits_{\pi}(r_{\pi}+\gamma P_{\pi}v_{k})\quad k=1,2,3,\cdots \end{aligned} vk+1=f(vk)=πmax(rπ+γPπvk)k=1,2,3,⋯
算法可拆分为两步(矩阵向量形式适合做理论分析,元素形式适合实现,因此,以下先给出matrix-vector form,然后给出 element-wise form 说明 ):
-
step 1: policy update
π k + 1 = max π ( r π + γ P π v ) \pi_{k+1} = \max\limits_{\pi}(r_\pi + \gamma P_\pi v) πk+1=πmax(rπ+γPπv)
where v k v_k vk is given.Element-wise Form:
π k + 1 = arg max π ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) ⏟ q k ( s , a ) , s ∈ S \pi_{k+1} = \arg \max\limits_{\pi} \sum\limits_{a}\pi(a|s) \underbrace{\left( \sum\limits_{r}p(r|s,a)r+\gamma \sum\limits_{s'}p(s'|s, a)v_{k}(s')\right)}_{q_k(s,a)},\quad s \in S πk+1=argπmaxa∑π(a∣s)qk(s,a) (r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′)),s∈Sπ k + 1 ( a ∣ s ) = { 1 a = a k ∗ ( s ) 0 a ≠ a k ∗ ( s ) \pi_{k+1}(a|s) = \begin{cases} 1 & a=a^*_{k}(s)\\ 0 & a \neq a^*_{k}(s) \end{cases} πk+1(a∣s)={10a=ak∗(s)a=ak∗(s)
where a k ∗ ( s ) = arg max a q k ( a , s ) a^*_{k}(s)=\arg\max\limits_{a}q_{k}(a, s) ak∗(s)=argamaxqk(a,s) , π k + 1 \pi_{k+1} πk+1 被称为贪婪策略,选择最大的q-value.
-
step 2: value update
v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} =r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk
Element-wise Form:
v k + 1 ( s ) = ∑ a π ( a ∣ s ) ( ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v k ( s ′ ) ) = max a q k ( a , s ) v_{k+1}(s) =\sum\limits_{a}\pi(a|s) \left( \sum\limits_{r}p(r|s,a)r+\gamma \sum\limits_{s'}p(s'|s, a)v_{k}(s')\right)=\max\limits_{a}q_k(a,s) vk+1(s)=a∑π(a∣s)(r∑p(r∣s,a)r+γs′∑p(s′∣s,a)vk(s′))=amaxqk(a,s)
说明: v k v_k vk 不是 state value, 不满足 Bellman equation.
例子:
|
|
| |
2. 策略迭代 Policy Iteration
算法描述:给定一个随机策略 π 0 \pi_0 π0
-
step 1: policy evaluation(PE)
计算 π k \pi_k πk 的 state value(采用迭代方式,见Bellman Equation):
v π k = r π k + γ P π k v π k v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk -
step2: policy improvement(PI)
π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1} = \arg \max\limits_{\pi}(r_\pi + \gamma P_\pi v_{\pi_k}) πk+1=argπmax(rπ+γPπvπk)
3. 截断策略迭代 Truncated Policy Iteration
3.1 Policy Interation and Value Interation
| Policy Iteration: start from π 0 \pi_0 π0 | Value Iteration: start from v 0 v_0 v0 |
|---|---|
| PE: v π k = r π k + γ P π k v π k v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k} vπk=rπk+γPπkvπk | PU: π k + 1 = max π ( r π + γ P π v ) \pi_{k+1} = \max\limits_{\pi}(r_\pi + \gamma P_\pi v) πk+1=πmax(rπ+γPπv) |
| PI: π k + 1 = arg max π ( r π + γ P π v π k ) \pi_{k+1} = \arg \max\limits_{\pi}(r_\pi + \gamma P_\pi v_{\pi_k}) πk+1=argπmax(rπ+γPπvπk) | VU: v k + 1 = r π k + 1 + γ P π k + 1 v k v_{k+1} =r_{\pi_{k+1}} + \gamma P_{\pi_{k+1}} v_k vk+1=rπk+1+γPπk+1vk |
3.2 Truncated Policy Iteration
求解 Policy Iteration 的 PE 过程中,计算 v π k v_{\pi_k} vπk 只计算 j j j 步,当 j = 1 j=1 j=1 时,即为 Value Iteration, j → ∞ j \rightarrow \infty j→∞ 时为 Policy Iteration,其他值( v π k v_{\pi_k} vπk 未收敛,此时写为 v k v_k vk )时则为 Truncated Policy Iteration。即
Policy Iteration 和 Value Iteration 为 Truncated Policy Iteration 的特殊情况。算法如下:
|
|

浙公网安备 33010602011771号