强化学习收敛性分析工具-1

Introduction

这一系列的文章主要是对强化学习中收敛性分析常用到的一些工具进行整理学习, 对分析工具进行整理有助于对结构的敏感度的提升以及相关直觉的形成.

第一节的内容主要参考 这篇文章, 同时补充了一些省略的证明过程.

Notation

1741068338721

1741069192059

\[\begin{align} V^{\pi}(s) & := \mathbb{E}[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s,\pi] \tag{1}\\ Q^{\pi}(s,a) & := \mathbb{E}[\sum_{t=0}^{\infty}\gamma^t r_t | s_0=s,a_0=a,\pi] \tag{2}\\ A^{\pi}(s,a) & := Q^{\pi}(s,a)-V^{\pi}(s) \tag{3} \end{align} \]

为了简化表述, 我们给出如下定义:

\[\begin{align} V^{\pi}({\mu}) & := \mathbb{E}_{s\sim\mu}[V^{\pi}(s)]\tag{4}\\ V(\pi) & := V^{\pi}({\mu}) \tag{5}\\ V(\theta) & := V(\pi_{\theta}) \tag{6} \end{align} \]

因此, 强化学习的目标可以被表述成:

\[\max_{\pi}V^{\pi}(\mu) \]

同时不难看出 \(V\)\(Q\) 的关系:

\[V^{\pi}(s) = \mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi}(s,a)] \]

根据上述定义, 可以得到如下的基本递推关系:

\[\begin{aligned} V^{\pi}(s) &= \mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)+\mathbb{E}_{s'\sim P(\cdot|s,a)}\mathbb{E}[\sum_{t=1}^{\infty}\gamma^t r_t | s_1=s',\pi]] \\ & = \mathbb{E}_{a\sim\pi(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] \\ Q^{\pi}(s,a) &= r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] \end{aligned} \]

Performance Difference Lemma

Performance Difference Lemma 的主要表述如下:

\[\begin{aligned} V(\tilde{\pi}) - V(\pi)&=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\tilde{\pi}}_{\mu}}([\mathcal{T}^{\tilde{\pi}}V^{\pi}-V^{\pi}](s))\\ &= \frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\tilde{\pi}}_{\mu}}\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s)}[A^{\pi}(s,a)] \end{aligned} \]

where \(\displaystyle d_{\mu}^{\tilde{\pi}}(s)=\mathbb{E}_{s_0\sim\mu}[(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)]\) and

\[\begin{aligned} \mathcal{T}^{\tilde{\pi}}V^{\pi} & :=\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s),s'\sim P(\cdot |s,a)}[r(s,a)+\gamma V(s')]\\ &=\mathbb{E}_{a\sim \tilde{\pi}(\cdot |s)}[Q^{\pi}(s,a)] \end{aligned} \]

Performance Difference Lemma 主要描述了两种策略 \(\pi\)\(\tilde{\pi}\) 在值函数 \(V\) 上的差异.

详细证明过程可以参考这个 讲义. 证明的主要思路是 拆项 \(\to\) 递归展开构造递归式 \(\to\) 求通项 \(\to\) 构造新的概率分布 \(d^{\tilde{\pi}}_{\mu}\) , 讲义上除了最后一步讲的都比较清楚.

proof

\[\begin{aligned} &V^{\tilde{\pi}}(s) - \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ]\\ &= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')] ]\\ &= \gamma\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')] \end{aligned} \]

\[\begin{aligned} &\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] - V^{{\pi}}(s)\\ &=\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[Q^{\pi}(s,a)] - V^{{\pi}}(s)\\ &= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[Q^{\pi}(s,a)- V^{{\pi}}(s)]\\ &= \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] \end{aligned} \]

\[\begin{aligned} &V^{\tilde{\pi}}(s) - V^{{\pi}}(s) \\ &= V^{\tilde{\pi}}(s) - \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ]\\ &\quad + \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[r(s,a)+\gamma\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\pi}(s')] ] -V^{{\pi}}(s)\\ &= \gamma\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}\mathbb{E}_{s'\sim P(\cdot|s,a)}[V^{\tilde{\pi}}(s')-V^{\pi}(s')] + \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] \end{aligned} \]

按照上述递推式展开, 我们有:

\[\begin{aligned} &V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0) \\ &=\sum_{t=0}^{n}\gamma^t \mathbb{E}_{s_t\sim\mathbb{P}^{\tilde{\pi}}_t(\cdot|s_0)}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s_t)}[A^{\pi}(s_t,a_t)]\\ &\quad+\gamma^{n+1} \mathbb{E}_{s_{n+1}\sim\mathbb{P}^{\tilde{\pi}}_{n+1}(\cdot|s_0)}[V^{\tilde{\pi}}(s_{n+1})-V^{\pi}(s_{n+1})] \end{aligned} \]

当奖励有界时, 由定义可知 \(V^{\pi}\), \(V^{\tilde{\pi}}\) 显然都有界, 因此我们有:

\[\begin{aligned} &V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0)\\ &=\sum_{t=0}^{+\infty}\gamma^t \mathbb{E}_{s_t\sim\mathbb{P}^{\tilde{\pi}}_t(\cdot|s_0)}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s_t)}[A^{\pi}(s_t,a_t)]\\ &=\sum_{t=0}^{+\infty}\gamma^t\sum_{s\in\mathcal{S}}\mathbb{E}_{a_t\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a_t)]\mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ &=\sum_{s\in\mathcal{S}} \mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ \end{aligned} \]

又因为

\[\begin{aligned} &\sum_{s\in\mathcal{S}}\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ &= \sum_{t=0}^{+\infty}\gamma^t \sum_{s\in\mathcal{S}}\mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ &=\frac{1}{1-\gamma} \end{aligned} \]

所以

\[\begin{aligned} &V^{\tilde{\pi}}(s_0) - V^{{\pi}}(s_0)\\ &= \sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] (1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ \end{aligned} \]

所以,

\[\begin{aligned} &V(\tilde{\pi}) - V({\pi})\\ &=\sum_{s_0\in\mathcal{S}}\mu(s_0)\sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] (1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ &=\sum_{s\in\mathcal{S}} \frac{1}{1-\gamma}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\sum_{s_0\in\mathcal{S}}\mu(s_0)(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)\\ &=\frac{1}{1-\gamma}\sum_{s\in\mathcal{S}}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)]\mathbb{E}_{s_0\sim\mu}[(1-\gamma)\sum_{t=0}^{+\infty}\gamma^t \mathbb{P}^{\tilde{\pi}}_t(s|s_0)]\\ &=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\tilde{\pi}}}\mathbb{E}_{a\sim\tilde{\pi}(\cdot|s)}[A^{\pi}(s,a)] \end{aligned} \]

posted @ 2025-03-04 16:17  p0q  阅读(25)  评论(0)    收藏  举报