1. 背景知识
1.1 ELBo
1.1.1 为什么引入隐变量 \(z\) ?
因为我们在现实世界看到的物体可能也产生于高层级的表示,高层级的表示或许概括了颜色、大小、形状等的抽象属性。
1.1.2 如何推导ELBo (Evidence Lower Bound)?
无条件的生成模型学习的是如何建模真实分布 \(p\left (x\right )\) ,所以有:
\[\begin{align}
\log{\underbrace{p\left (x\right )}_{\text evidence}}
=& \log{p\left (x\right )}\int \underbrace{q_{\phi}\left (z\mid x\right )}_{\text{approximate posterior}}dz \\
=&\int q_{\phi}\left (z\mid x\right )\left (\log{p\left (x\right )}\right )dz \\
=&\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{p\left (x\right )}\right ] \\
=&\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (x,z\right )}{p\left (z\mid x\right )}}\right ] \\
=&\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [ \log{\frac{p\left (x,z\right )q_{\phi}\left (z\mid x\right )}{p\left (z\mid x\right )q_{\phi}\left (z\mid x\right )}}\right ] \\
=&\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\mid x\right )}}\right ] + \mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{q_{\phi}\left (z\mid x\right )}{p\left (z\mid x\right )}}\right ] \\
=&\underbrace{\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (x,z\right )}{q_{\phi}\left (z\mid x\right )}}\right ]}_{\text{ELBo}} + \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\mid x\right )}_{\text{approximate posterior}} \mid \underbrace{p\left (z\mid x\right )}_{\text{true posterior}}\right )}_{\geq 0} \\
&\geq \underbrace{\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\mid x\right )}}\right ]}_{\text{ELBo}}
\end{align}\]
1.1.3 为什么要最大化ELBo?
原因1:因为我们想要模型学习近似后验 \(q_{\phi}\left (z\mid x\right )\) 无限接近真实后验 \(p\left (z\mid x\right )\),但是无法直接求公式 (7) 中的 \(D_{KL}\) 项:
\[\begin{align}
\min_{\phi}{\underbrace{D_{KL}\left (\underbrace{\underbrace{q_{\phi}\left (z\mid x\right )}_{\text{approximate posterior}} }_{\text{Encoder is learnable}} \mid \underbrace{\underbrace{p\left (z\mid x\right )}_{\text{true posterior}}}_{\text{unknow}}\right )}_{\text{untractable}}}
\end{align}\]
原因2:对于任意的样本\(x_i \sim p\left (x\right )\),\(p\left (x_i\right )\)是个常数,那么通过\(\max_{\phi}{\text{ELBo}}\)等价于\(\min_{\phi}{D_{KL}}\)
\[\begin{align}
\because\log{\underbrace{p\left (x_i\right )}_{\text constant}} =& \underbrace{\mathbb{E}_{q_{\phi}\left (z\mid x_i\right )}\left [\log{\frac{p\left (x_{i},z\right )}{q_{\phi}\left (z\mid x_i\right )}}\right ]}_{\text{ELBo}} + \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\mid x_i\right )}_{\text{approximate posterior}} \mid \underbrace{p\left (z\mid x_i\right )}_{\text{true posterior}}\right )}_{\geq 0} \\
\min_{\phi}{D_{KL}} &\iff \max_{\phi}{\text{ELBo}}
\end{align}\]
2. VAE (Variational Auto Encoder)
2.1 为什么有Variational?
因为我们优化的 \(q_{\phi}\left (z\mid x\right )\) 服从某一分布族,该分布族被 \(\mathbf{\phi}\) 参数化,这就是 Variational 的来源。
2.2 为什么有Auto Encoder?
因为模型会像 Autoencoder 模型一样压缩数据维度,提取数据中的有效信息。
2.3 VAE的优化目标
\[\begin{align}
&\max_{\phi}\underbrace{\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (x, z\right )}{q_{\phi}\left (z\mid x\right )}}\right ]}_{\text{ELBo}} \\
=& \max_{\phi,\theta}\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p_{\theta}\left (x\mid z\right )p\left (z\right )}{q_{\phi}\left (z\mid x\right )}}\right ] \\
=&\max_{\phi,\theta}\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{p_{\theta}\left (x\mid z\right )}\right ] + \mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\frac{p\left (z\right )}{q_{\phi}\left (z\mid x\right )}}\right ]\\
=&\max_{\phi,\theta}\underbrace{\mathbb{E}_{q_{\phi}\left (z\mid x\right )}\left [\log{\underbrace{p_{\theta}\left (x\mid z\right )}_{\text{Decoder}}}\right ]}_{\text{resconstraction term}} - \underbrace{D_{KL}\left (\underbrace{q_{\phi}\left (z\mid x\right )}_{\text{Encoder}} \mid \underbrace{p\left (z\right )}_{\text{prior}}\right )}_{\text{prior matching term}} \\
\approx & \max_{\phi,\theta}\sum_{l=1}^{L}\log{p_{\theta}\left (x\mid z^{l}\right )} - D_{KL}\left (\underbrace{q_{\phi}\left (z\mid x\right )}_{\sim N\left (\mu,\sigma^2\right )}\mid \underbrace{p\left (z\right )}_{\sim N\left (0,1\right )}\right ) \quad \text{(apply Monte Carlo Estimate)} \\
=&\max_{\phi,\theta}\sum_{l=1}^{L}\log{p_{\theta}\left (x\mid z^{l}\right )} - \frac{1}{2}\left (-\log{\sigma^2} + \mu^2 + \sigma^2 - 1\right )
\end{align}\]
优化目标主要包括了两项:
- 重构项 (reconstruction term) 迫使模型的解码器 (Decoder) 学习由隐变量 \(\boldsymbol{z}\) 恢复原始样本的能力;
- 先验匹配项 (prior matching term) 迫使模型的编码器 (Encoder) 学习将原始样本转换到先验分布 (标准正态分布) 的能力。
2.4 VAE模型架构
![VAE模型示例图]()
3. VAE的训练
训练过程将批量的图片送入模型中,每张图片由 Encoder 产生 \(\mu\) 和 \(\sigma\),进而生成服从 \(\mathcal{N}\left (\mu, \sigma^2\right )\) 的隐变量 \(\boldsymbol{z}\),最后经过 Decoder 生成图片,整体流程如下:
\[\underbrace{x}_{x \sim p\left (x\right )} \rightarrow\underbrace{\text{Encoder}}_{q_{\phi}\left (z\mid x\right )}\rightarrow \mu,\sigma \rightarrow \underbrace{\underbrace{z\sim \mathbf{N\left (\mu,\sigma^2\right )}}_{z=\mu + \sigma \odot \epsilon, \text{with } \epsilon \sim N\left (0,I\right )}}_{\text{reparameterization trick}} \rightarrow \underbrace{\text{Decoder}}_{p_{\theta}\left (x\mid z\right )}\rightarrow\hat{x}
\]
其中,训练过程中会采用重参数化技巧 (reparameterization trick) 使得整个过程可导,因为这样 \(\mu\) 和 \(\sigma\) 变成可导的参数,变化的 \(\epsilon\) 被看作不用求导的常数,不被算在梯度图中。
4. VAE的推理
推理只需要从标准正态分布中采样隐变量 \(z\) 即可以生成新的样本,因为 VAE 目标函数中的先验匹配项迫使 \(z\) 逐渐逼近标准正态分布,整体流程如下:
\[\underbrace{z}_{\mathbf{z \sim N\left (0,I\right )}} \rightarrow \underbrace{\text{Decoder}}_{p_{\theta}\left (x \mid z\right )} \rightarrow \underbrace{\hat{x}}_{\text{new sample}}
\]