扩散模型(一):DDPM
扩散过程
扩散过程会在原始数据 \(x_0\) 上,分\(T\)不断增加随机噪声 \(\beta_i,\,\beta_i\in(0,1)\) ,最终得到标准正态分布 \(x_T\) 。
该过程可表示为:
\[q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t\mathbf{I})
\]
或:
\[x_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilon_t,\,\epsilon_t\sim\mathcal{N}(0,\mathbf{I})
\]
由高斯分布的可叠加性质,可由 \(x_0\) 直接得到 \(x_t\) :
\[q(x_t|x_{0})=\mathcal{N}(x_t;\sqrt{\overline{\alpha}_t}x_{0},(1-\overline{\alpha}_t)\mathbf{I})
\]
或:
\[x_t=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{\epsilon_t},\,\overline{\epsilon_t}\sim\mathcal{N}(0,\mathbf{I})
\]
反向过程
扩散模型的反向过程从一个随机噪声 \(x_T\sim\mathcal{N}(0,\mathbf{I})\) 开始,逐步去噪从而得到一个真实样本。该过程由神经网络建模:
\[p_{\theta}(x_{0:T})=p(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)
\]
\[p_{\theta}(x_{t-1}|x_t)=\mathcal{N}(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t))
\]
DDPM对 \(p_{\theta}(x_{t-1}|x_t)\) 进行了简化,使用了固定的方差:
\[p_{\theta}(x_{t-1}|x_t)=\mathcal{N}(x_{t-1};\mu_{\theta}(x_t,t),\sigma_t^2\mathbf{I})
\]
优化目标
为了拟合数据分布,我们使用最大似然估计MLE,使得我们建模的数据分布中,从真实数据分布中采样的数据出现的概率最大化:
\[\mathbb{E}_{x\sim p_{data}(x)}\left[\log p_{\theta}(x)\right]
\]
DDPM通过变分推断得到变分下界VLB作为优化目标:
\[\begin{align*}
\log{p_{\theta}(x_0)} &= \log{\int{p_{\theta}(x_{0:T})dx_{1:T}}} \\
&= \log{\int{\frac{p_{\theta}(x_{0:T})q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)}}dx_{1:T}} \\
&= \log{\mathbb{E}_{q(x_{1:T}|x_0)}\left[\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}\right]} \\
&\geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}}\right]
\end{align*}
\]
上述推导的最后一步运用了jensen's inequality,对凹函数,有 \(f(\mathbb{E}(x))\geq\mathbb{E}(f(x))\) ,训练目标对VLB取负:
\[\begin{align*}
L=-L_{VLB}&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}}\right] \\
&= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{q(x_T|x_0)\prod_{t=2}^Tq(x_{t-1}|x_t,x_0)}{p_{\theta}(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)}}\right] \\
&= \mathbb{E}_{q(x_T|x_0)}\left[\log\frac{q(x_T|x_0)}{p_{\theta}(x_T)}\right]+\sum_{t=2}^T\mathbb{E}_{q(x_{t-1},x_t|x_0)}\left[\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\
&= \int{q(x_T|x_0)}\log\frac{q(x_T|x_0)}{p_{\theta}(x_T)}dx_T+\sum_{t=2}^T\int{q(x_t|x_0)q(x_{t-1}|x_t,x_0)}\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}dx_{t-1}dx_t-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\
&= D_{KL}(q(x_T|x_0)||p_{\theta}(x_T))+\sum_{t=2}^T\mathbb{E}_{q(x_t|x_0)}\left[\int{q(x_{t-1}|x_t,x_0)}\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}dx_t\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\
&= D_{KL}(q(x_T|x_0)||p_{\theta}(x_T))+\sum_{t=2}^T\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right]
\end{align*}
\]
第一项为两个标准正态分布的KL散度,不含训练参数,近似为0。第三项可以看作原始数据重建,这里不作讨论。
由此,我们得到网络的训练目标为:
\[\begin{align*}
\tilde{\theta} &= argmin_{\theta}\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right] \\
&= argmin_{\theta}\mathbb{E}_{q(x_0)}\left[\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]\right] \\
&= argmin_{\theta}\int{q(x_t|x_0)q(x_0)D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))dx_0dx_t} \\
&= argmin_{\theta}\int{q(x_t)D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))dx_t} \\
&= argmin_{\theta}\mathbb{E}_{q(x_t)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]
\end{align*}
\]
利用两个高斯分布的KL散度计算公式,训练目标可简化为:
\[\begin{align*}
\mathbb{E}_{q(x_t)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]&=\mathbb{E}_{q(x_t)}\left[||\tilde{\mu}_t(x_t,x_0)-\mu_{\theta}(x_t,t)||^2\right] \\
&=\mathbb{E}_{x_0,\epsilon\sim\mathcal{N}(0,1)}\left[||\tilde{\mu}_t(x_t(x_0,\epsilon),x_0)-\mu_{\theta}(x_t(x_0,\epsilon),t)||^2\right]
\end{align*}
\]
由贝叶斯定理:
\[\begin{align*}
q(x_{t-1}|x_t,x_0) &= \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \\
&= \mathcal{N}(x_{t-1},\tilde{\mu}_t(x_t,x_0),\tilde{\beta}_t\mathbf{I})
\end{align*}
\]
由于扩散模型的前向过程是一个马尔科夫链,则 \(q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1})\) 。代入三项正态分布表达式得到:
\[\tilde{\mu}_t(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_t}x_t+\frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1-\overline{\alpha}_t}x_0
\]
将\(x_0=\frac{1}{\sqrt{\overline{\alpha}_t}}x_t-\frac{\sqrt{1-\overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}}\overline{\epsilon}_t\)代入上式化简可得:
\[\tilde{\mu}_t(x_t,x_0)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline{\epsilon}_t)
\]
结合上述简化得到的训练目标,我们将 \(\mu_{\theta}(x_t,t)\) 重参数化为:
\[\begin{align*}
\mu_{\theta}(x_t,t) &= \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(x_t,t)) \\
&= \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{\epsilon_t},t))
\end{align*}
\]
即,使用神经网络建模预测加噪数据 \(x_t\) 中相比原始数据 \(x_0\) 添加的噪声 \(\overline{\epsilon}_t\) 。化简并去掉权重系数,我们得到最终的损失函数:
\[L=\mathbb{E}_{x_0,\epsilon\sim\mathcal{N}(0,1)}\left[||\epsilon-\epsilon_{\theta}(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||^2\right]
\]
网络训练完毕后,反向过程采样的分布为:
\[x_{t-1}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(x_t,t))+\sigma_tz,z\sim\mathcal{N}(0,1)
\]