强化学习收敛性分析工具-1
Introduction
这一系列的文章主要是对强化学习中收敛性分析常用到的一些工具进行整理学习, 对分析工具进行整理有助于对结构的敏感度的提升以及相关直觉的形成.
第一节的内容主要参考 这篇文章, 同时补充了一些省略的证明过程.
Notation


\[\begin{align}
V^{\pi}(s) & := \mathbb{E}[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s,\pi] \tag{1}\\
Q^{\pi}(s,a) & := \mathbb{E}[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s,a_0=a,\pi] \tag{2}\\
A^{\pi}(s,a) & := Q^{\pi}(s,a)-V^{\pi}(s) \tag{3}
\end{align}
\]
为了简化表述, 我们给出如下定义:
\[\begin{align}
V^{\pi}({\mu}) & := \mathbb{E}_{s\sim\mu}[V^{\pi}(s)]\tag{4}\\
V(\pi) & := V^{\pi}({\mu}) \tag{5}\\
V(\theta) & := V(\pi_{\theta}) \tag{6}
\end{align}
\]
因此, 强化学习的目标可以被表述成:
\[\max_{\pi}V^{\pi}(\mu)
\]
同时不难看出 \(V\) 和 \(Q\) 的关系:
\[V^{\pi}(s) = \mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi}(s,a)]
\]
根据上述定义, 可以得到如下的基本递推关系:
\[\begin{aligned}
V^{\pi}(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)+\mathbb{E}_{s'\sim P(\cdot|s,a)}\mathbb{E}[\sum_{t=1}^{\infty}\gamma^t r_t | s_1=s',\pi]] \\
& = \mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] \\
Q^{\pi}(s,a) &= r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')]
\end{aligned}
\]
Performance Difference Lemma
Performance Difference Lemma 的主要表述如下:
\[\begin{aligned}
V(\tilde{\pi}) - V(\pi)&=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\tilde{\pi}}_{\mu}}([\mathcal{T}^{\tilde{\pi}}V^{\pi}-V^{\pi}](s))\\
&= \frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\tilde{\pi}}_{\mu}}\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s)}[A^{\pi}(s,a)]
\end{aligned}
\]
where \(\displaystyle d_{\mu}^{\tilde{\pi}}(s)=\mathbb{E}_{s_0\sim\mu}[(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)]\) and
\[\begin{aligned}
\mathcal{T}^{\tilde{\pi}}V^{\pi} & :=\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s),s'\sim P(\cdot |s,a)}[r(s,a)+\gamma V(s')]\\
&=\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s)}[Q^{\pi}(s,a)]
\end{aligned}
\]
Performance Difference Lemma 主要描述了两种策略 \(\pi\) 和 \(\tilde{\pi}\) 在值函数 \(V\) 上的差异.
详细证明过程可以参考这个 讲义. 证明的主要思路是 拆项 \(\to\) 递归展开构造递归式 \(\to\) 求通项 \(\to\) 构造新的概率分布 \(d^{\tilde{\pi}}_{\mu}\) , 讲义上除了最后一步讲的都比较清楚.
proof
\[\begin{aligned}
&V^{\tilde{\pi}}(s) - \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ]\\
&= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')] ]\\
&= \gamma\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')]
\end{aligned}
\]
\[\begin{aligned}
&\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] - V^{{\pi}}(s)\\
&=\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[Q^{\pi}(s,a)] - V^{{\pi}}(s)\\
&= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[Q^{\pi}(s,a)- V^{{\pi}}(s)]\\
&= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]
\end{aligned}
\]
\[\begin{aligned}
&V^{\tilde{\pi}}(s) - V^{{\pi}}(s) \\
&= V^{\tilde{\pi}}(s) - \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ]\\
&\quad + \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] -V^{{\pi}}(s)\\
&= \gamma\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')] + \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]
\end{aligned}
\]
按照上述递推式展开, 我们有:
\[\begin{aligned}
&V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0) \\
&=\sum_{t=0}^{n}\gamma^t \mathbb{E}_{s_t\sim\mathbb{P}^{\tilde{\pi}}_t(\cdot|s_0)}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s_t)}[A^{\pi}(s_t,a_t)]\\
&\quad+\gamma^{n+1} \mathbb{E}_{s_{n+1}\sim\mathbb{P}^{\tilde{\pi}}_{n+1}(\cdot|s_0)}[V^{\tilde{\pi}}(s_{n+1})-V^{\pi}(s_{n+1})]
\end{aligned}
\]
当奖励有界时, 由定义可知 \(V^{\pi}\), \(V^{\tilde{\pi}}\) 显然都有界, 因此我们有:
\[\begin{aligned}
&V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0)\\
&=\sum_{t=0}^{+\infty}\gamma^t \mathbb{E}_{s_t\sim\mathbb{P}^{\tilde{\pi}}_t(\cdot|s_0)}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s_t)}[A^{\pi}(s_t,a_t)]\\
&=\sum_{t=0}^{+\infty}\gamma^t\sum_{s\in\mathcal{S}}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a_t)]\mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
&=\sum_{s\in\mathcal{S}} \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
\end{aligned}
\]
又因为
\[\begin{aligned}
&\sum_{s\in\mathcal{S}}\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
&= \sum_{t=0}^{+\infty}\gamma^t \sum_{s\in\mathcal{S}}\mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
&=\frac{1}{1-\gamma}
\end{aligned}
\]
所以
\[\begin{aligned}
&V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0)\\
&= \sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] (1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
\end{aligned}
\]
所以,
\[\begin{aligned}
&V(\tilde{\pi}) - V({\pi})\\
&=\sum_{s_0\in\mathcal{S}}\mu(s_0)\sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] (1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
&=\sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\sum_{s_0\in\mathcal{S}}\mu(s_0)(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\
&=\frac{1}{1-\gamma}\sum_{s\in\mathcal{S}}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\mathbb{E}_{s_0\sim\mu}[(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)]\\
&=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\tilde{\pi}}}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]
\end{aligned}
\]

RL theory 分析工具链
浙公网安备 33010602011771号