结论
若两个连续时间马尔可夫链 (CTMC) \(X_t, \hat X_t\) 的分布列 \(\pi_t, \hat \pi_t\) 分别由时变的转移率矩阵 \(R_t\) 与 \(\hat R_t\) 生成,初始 \(\pi_0=\hat\pi_0\),那么:
\[\mathbb E_{x \sim \pi_1}\bigg[ \log\frac{\hat\pi_1(x)}{\pi_1(x)} \bigg] \ge \int_0^1 \mathbb E_{x \sim \pi_t} \bigg[ \hat R_t(x, x)-R_t(x, x) + \sum_{y \ne x}R_t(x, y)\log\frac{\hat R_t(x, y)}{R_t(x, y)} \bigg] \mathrm{d}t.
\]
MultiFlow 的附录里展示了一个证明,但需要把 CTMC 转到跳变时刻的空间上取一个参考测度来定义密度,还是比较复杂的。有没有更简单的方式呢?
证明
我们只需要用到 CTMC 的以下基本性质:
- \(\forall x \ne y, R_t(x, y) \ge 0;\)
- \(\forall x, R_t(x, x)=-\sum_{y \ne x}R_t(x, y).\)
- \(\dot\pi_t(x)=\sum_y \pi_t(y)R_t(y, x);\)
只需要证明
\[\frac{\mathrm d}{\mathrm dt}\mathbb E_{x \sim \pi_t}\bigg[ \log\frac{\hat\pi_t(x)}{\pi_t(x)} \bigg] \ge \mathbb E_{x \sim \pi_t} \bigg[ \hat R_t(x, x)-R_t(x, x) + \sum_{y \ne x}R_t(x, y)\log\frac{\hat R_t(x, y)}{R_t(x, y)} \bigg].
\]
计算左边:
\[\begin{align*}
\frac{\mathrm d}{\mathrm dt}\sum_x \pi_t(x)\log\frac{\hat\pi_t(x)}{\pi_t(x)}
& = \sum_x \Bigg( \dot\pi_t(x)\log\frac{\hat\pi_t(x)}{\pi_t(x)}+\pi_t(x)\frac{\frac{\mathrm d}{\mathrm dt}\hat\pi_t(x)}{\hat\pi_t(x)}-\dot\pi_t(x) \Bigg) \\
& = \sum_x \Bigg( \dot\pi_t(x)\bigg(\log\frac{\hat\pi_t(x)}{\pi_t(x)}-1\bigg)+\pi_t(x)\sum_y\frac{\hat\pi_t(y)\hat R_t(y, x)}{\hat\pi_t(x)} \Bigg) \\
& =: L_1 + L_2 + L_3,
\end{align*}
\]
其中
\[\begin{align*}
L_1 & =\sum_x\dot\pi_t(x)\left( \log\frac{\hat\pi_t(x)}{\pi_t(x)}-1 \right), \\
L_2 & =\sum_x\pi_t(x)\hat R_t(x, x), \\
L_3 & =\sum_{(x, y): x \ne y}\pi_t(x)\dfrac{\hat\pi_t(y)\hat R_t(y, x)}{\hat\pi_t(x)}.
\end{align*}
\]
估计 \(L_3\):
\[\begin{align*}
L_3 & = \sum_{(x, y): x \ne y}\frac{\hat\pi_t(y)\hat R_t(y, x)}{\hat\pi_t(x)}\pi_t(x) \\
& = \sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x) \frac{\hat\pi_t(y)\hat R_t(y, x)\pi_t(x)}{\pi_t(y) R_t(y, x) \hat\pi_t(x)} \\
& \ge \sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x) \left(1 + \log\frac{\hat\pi_t(y)\hat R_t(y, x)\pi_t(x)}{\pi_t(y) R_t(y, x) \hat\pi_t(x)}\right) \\
& = \sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x) \left(1 + \log\frac{\hat\pi_t(y)}{\pi_t(y)}+\log\frac{\hat R_t(y, x)}{R_t(y, x)}+\log\frac{\pi_t(x)}{\hat\pi_t(x)}\right).
\end{align*}
\]
注意到 \(\forall x,\ \sum_{y \ne x}\pi_t(y)R_t(y, x)=\dot\pi(x)-\pi_t(x)R_t(x, x),\) 所以
\[\begin{align*}
\sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x) \Big(1 + \log\frac{\pi_t(x)}{\hat\pi_t(x)}\Big)&=\sum_x \Big(\dot\pi(x)-\pi_t(x)R_t(x, x)\Big)\Big(1 - \log\frac{\hat\pi_t(x)}{\pi_t(x)}\Big)\\
&=\sum_x \pi_t(x)R_t(x, x)\Big( \log\frac{\hat\pi_t(x)}{\pi_t(x)}-1 \Big)-L_1
\end{align*}
\]
此外替换一下求和变量名:
\[\sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x)\log\frac{\hat R_t(y, x)}{R_t(y, x)}
=\sum_{(x, y): x \ne y}\pi_t(x) R_t(x, y)\log\frac{\hat R_t(x, y)}{R_t(x, y)}
\]
由 \(\forall x, R_t(x, x)=-\sum_{y \ne x}R_t(x, y)\) 得
\[\begin{align*}
\sum_{(x, y): x \ne y}\pi_t(y) R_t(y, x)\log\frac{\hat\pi_t(y)}{\pi_t(y)}
& =\sum_{(x, y): x \ne y}\pi_t(x) R_t(x, y)\log\frac{\hat\pi_t(x)}{\pi_t(x)} \\
& =-\sum_{x}\pi_t(x) R_t(x, x)\log\frac{\hat\pi_t(x)}{\pi_t(x)}
\end{align*}
\]
因此,
\[\begin{align*}
L_1+L_2+L_3 & \ge\sum_x \pi_t(x)\Bigg[ \hat R_t(x, x)-R_t(x, x) + \sum_{y \ne x}R_t(x, y)\log\frac{\hat R_t(x, y)}{R_t(x, y)} \Bigg]
\end{align*}
\]
这就证完了。
与 ELBO 的联系
注意到唯一一处不等式是
\[\begin{equation}
\sum_{y \ne x}\frac{\hat\pi_t(y)\hat R_t(y, x)\pi_t(x)}{\hat\pi_t(x)} \ge \sum_{y \ne x}\pi_t(y) R_t(y, x) \left(1 + \log\frac{\hat\pi_t(y)\hat R_t(y, x)\pi_t(x)}{\pi_t(y) R_t(y, x) \hat\pi_t(x)}\right)
\end{equation}
\]
我们说明:这正是在无穷小时间上发生的 ELBO.
1. ELBO 就是 KL 散度非负
从我们熟悉的 ELBO 形式出发:
\[\log p_X(x) = \log\sum_z p(x, z) = \log\mathbb E_{z \sim q(\cdot \mid x)} \bigg[ \frac{p(x, z)}{q(z \mid x)} \bigg] \ge \mathbb E_{z \sim q(\cdot \mid x)} \bigg[ \log\frac{p(x, z)}{q(z \mid x)} \bigg]
\]
注意到
\[\log\frac{p(x, z)}{q(z \mid x)}-\log p_X(x)=\log\frac{p_{Z \mid X}(z \mid x)}{q(z \mid x)}
\]
所以上述不等式两边减掉 \(\log p_X(x)\) 就是
\[\begin{equation}
0 \ge \mathbb E_{z \sim q(\cdot \mid x)} \bigg[ \log\frac{p_{Z \mid X}(z \mid x)}{q(z \mid x)} \bigg] =: -D_{\rm KL}\Big( q(\cdot \mid x) \mathop{\Big\Vert} p_{Z \mid X}(\cdot \mid x) \Big).
\end{equation}
\]
2. 逆过程的转移率
对于 \(x \ne y\),令
\[R_t^\gets(x, y)=\frac{\pi_t(y) R_t(y, x)}{\pi_t(x)}, \hat R_t^\gets(x, y)=\frac{\hat\pi_t(y)\hat R_t(y, x)}{\hat \pi_t(x)},
\]
并令 \(R_t^\gets(x, x)=-\sum_{y \ne x}R_t^\gets(x, y), \hat R_t^\gets(x, x)=-\sum_{y \ne x}\hat R_t^\gets(x, y)\).
这样,(1) 式就等价于
\[\begin{equation}
-\hat R_t^\gets(x, x) \ge -R_t^\gets(x, x)+\sum_{y \ne x}R_t^\gets(x, y)\log\frac{\hat R_t^\gets(x, y)}{R_t^\gets(x, y)}.
\end{equation}
\]
我们现在说明:如上定义的 \(R_t^\gets\) 为逆过程的转移率,即
\[\Pr(X_{t-\mathrm dt}=y \mid X_t=x)=1_{x=y}+R_t^\gets(x, y)\mathop{}\!\mathrm dt.
\]
对于 \(x \ne y\) 的情形,
\[\begin{align*}
\Pr(X_{t-\mathrm dt}=y \mid X_t=x) & =\frac{\pi_{t-\mathrm dt}(y)}{\pi_t(x)}\Pr(X_t=x \mid X_{t-\mathrm dt}=y) \\
& =\frac{\pi_t(y)-\dot\pi_t(y)\,\mathrm dt}{\pi_t(x)}R_t(y, x)\,\mathrm dt \\
& =\frac{\pi_t(y)}{\pi_t(x)}R_t(y, x)\,\mathrm dt \\
& =R_t^\gets(x, y)\,\mathrm dt;
\end{align*}
\]
对于 \(x=y\) 的情形,
\[\begin{align*}
\Pr(X_{t-\mathrm dt}=x \mid X_t=x) & =1-\sum_{y' \ne x}\Pr(X_{t-\mathrm dt}=y' \mid X_t=x) \\
& =1-\sum_{y' \ne x}R_t^\gets(x, y')\,\mathrm dt \\
& =1+R_t^\gets(x, x)\,\mathrm dt,
\end{align*}
\]
所以 \(R_t^\gets\) 确实是 \(\{X_t\}\) 的逆过程的转移率,同理 \(\hat R_t^\gets\) 也是 \(\{\hat X_t\}\) 的逆过程的转移率。
3. 从 ELBO 变形至那处不等式
将 \(q(z \mid x)=\Pr(X_{t-\mathrm dt}=z \mid X_t=x)\) 以及 \(p_{Z \mid X}(z \mid x)=\Pr(\hat X_{t-\mathrm dt}=z \mid \hat X_t=x)\) 代入 (2) 可得
\[\begin{align*}
0 & \ge \sum_z \Pr(X_{t-\mathrm dt}=z \mid X_t=x)\log\frac{\Pr(\hat X_{t-\mathrm dt}=z \mid \hat X_t=x)}{\Pr(X_{t-\mathrm dt}=z \mid X_t=x)} \\
& =\Big(1+R_t^\gets(x, x)\,\mathrm dt\Big)\log\frac{1+\hat R_t^\gets(x, x)\,\mathrm dt}{1+R_t^\gets(x, x)\,\mathrm dt}+\sum_{z \ne x}\Big( R_t^\gets(x, z)\,\mathrm dt \Big)\log\frac{\hat R_t^\gets(x, x)\,\mathrm dt}{R_t^\gets(x, x)\,\mathrm dt} \\
& =\Bigg( \hat R_t^\gets(x, x)-R_t^\gets(x, x)+\sum_{z \ne x}R_t^\gets(x, z)\log\frac{\hat R_t^\gets(x, x)}{R_t^\gets(x, x)} \Bigg)\,\mathrm dt
\end{align*}
\]
去掉 \(\mathrm dt\) 就等价于 (3) 式,从而等价于 (1) 式。