扩散模型(一):DDPM

扩散模型(一):DDPM

扩散过程

扩散过程会在原始数据 \(x_0\) 上,分\(T\)不断增加随机噪声 \(\beta_i,\,\beta_i\in(0,1)\) ,最终得到标准正态分布 \(x_T\)
该过程可表示为:

\[q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t\mathbf{I}) \]

或:

\[x_t=\sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilon_t,\,\epsilon_t\sim\mathcal{N}(0,\mathbf{I}) \]

由高斯分布的可叠加性质,可由 \(x_0\) 直接得到 \(x_t\)

\[q(x_t|x_{0})=\mathcal{N}(x_t;\sqrt{\overline{\alpha}_t}x_{0},(1-\overline{\alpha}_t)\mathbf{I}) \]

或:

\[x_t=\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{\epsilon_t},\,\overline{\epsilon_t}\sim\mathcal{N}(0,\mathbf{I}) \]

反向过程

扩散模型的反向过程从一个随机噪声 \(x_T\sim\mathcal{N}(0,\mathbf{I})\) 开始,逐步去噪从而得到一个真实样本。该过程由神经网络建模:

\[p_{\theta}(x_{0:T})=p(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t) \]

\[p_{\theta}(x_{t-1}|x_t)=\mathcal{N}(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)) \]

DDPM对 \(p_{\theta}(x_{t-1}|x_t)\) 进行了简化,使用了固定的方差:

\[p_{\theta}(x_{t-1}|x_t)=\mathcal{N}(x_{t-1};\mu_{\theta}(x_t,t),\sigma_t^2\mathbf{I}) \]

优化目标

为了拟合数据分布,我们使用最大似然估计MLE,使得我们建模的数据分布中,从真实数据分布中采样的数据出现的概率最大化:

\[\mathbb{E}_{x\sim p_{data}(x)}\left[\log p_{\theta}(x)\right] \]

DDPM通过变分推断得到变分下界VLB作为优化目标:

\[\begin{align*} \log{p_{\theta}(x_0)} &= \log{\int{p_{\theta}(x_{0:T})dx_{1:T}}} \\ &= \log{\int{\frac{p_{\theta}(x_{0:T})q(x_{1:T}|x_0)}{q(x_{1:T}|x_0)}}dx_{1:T}} \\ &= \log{\mathbb{E}_{q(x_{1:T}|x_0)}\left[\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}\right]} \\ &\geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)}}\right] \end{align*} \]

上述推导的最后一步运用了jensen's inequality,对凹函数,有 \(f(\mathbb{E}(x))\geq\mathbb{E}(f(x))\) ,训练目标对VLB取负:

\[\begin{align*} L=-L_{VLB}&=\mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}}\right] \\ &= \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log{\frac{q(x_T|x_0)\prod_{t=2}^Tq(x_{t-1}|x_t,x_0)}{p_{\theta}(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)}}\right] \\ &= \mathbb{E}_{q(x_T|x_0)}\left[\log\frac{q(x_T|x_0)}{p_{\theta}(x_T)}\right]+\sum_{t=2}^T\mathbb{E}_{q(x_{t-1},x_t|x_0)}\left[\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\ &= \int{q(x_T|x_0)}\log\frac{q(x_T|x_0)}{p_{\theta}(x_T)}dx_T+\sum_{t=2}^T\int{q(x_t|x_0)q(x_{t-1}|x_t,x_0)}\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}dx_{t-1}dx_t-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\ &= D_{KL}(q(x_T|x_0)||p_{\theta}(x_T))+\sum_{t=2}^T\mathbb{E}_{q(x_t|x_0)}\left[\int{q(x_{t-1}|x_t,x_0)}\log\frac{q(x_{t-1}|x_t,x_0)}{p_{\theta}(x_{t-1}|x_t)}dx_t\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \\ &= D_{KL}(q(x_T|x_0)||p_{\theta}(x_T))+\sum_{t=2}^T\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]-\mathbb{E}_{q(x_1|x_0)}\left[\log{p_{\theta}(x_0|x_1)}\right] \end{align*} \]

第一项为两个标准正态分布的KL散度,不含训练参数,近似为0。第三项可以看作原始数据重建,这里不作讨论。

由此,我们得到网络的训练目标为:

\[\begin{align*} \tilde{\theta} &= argmin_{\theta}\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right] \\ &= argmin_{\theta}\mathbb{E}_{q(x_0)}\left[\mathbb{E}_{q(x_t|x_0)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]\right] \\ &= argmin_{\theta}\int{q(x_t|x_0)q(x_0)D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))dx_0dx_t} \\ &= argmin_{\theta}\int{q(x_t)D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))dx_t} \\ &= argmin_{\theta}\mathbb{E}_{q(x_t)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right] \end{align*} \]

利用两个高斯分布的KL散度计算公式,训练目标可简化为:

\[\begin{align*} \mathbb{E}_{q(x_t)}\left[D_{KL}(q(x_{t-1}|x_t,x_0)||p_{\theta}(x_{t-1}|x_t))\right]&=\mathbb{E}_{q(x_t)}\left[||\tilde{\mu}_t(x_t,x_0)-\mu_{\theta}(x_t,t)||^2\right] \\ &=\mathbb{E}_{x_0,\epsilon\sim\mathcal{N}(0,1)}\left[||\tilde{\mu}_t(x_t(x_0,\epsilon),x_0)-\mu_{\theta}(x_t(x_0,\epsilon),t)||^2\right] \end{align*} \]

由贝叶斯定理:

\[\begin{align*} q(x_{t-1}|x_t,x_0) &= \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &= \mathcal{N}(x_{t-1},\tilde{\mu}_t(x_t,x_0),\tilde{\beta}_t\mathbf{I}) \end{align*} \]

由于扩散模型的前向过程是一个马尔科夫链,则 \(q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1})\) 。代入三项正态分布表达式得到:

\[\tilde{\mu}_t(x_t,x_0)=\frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_t}x_t+\frac{\sqrt{\overline{\alpha}_{t-1}}\beta_t}{1-\overline{\alpha}_t}x_0 \]

\(x_0=\frac{1}{\sqrt{\overline{\alpha}_t}}x_t-\frac{\sqrt{1-\overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}}\overline{\epsilon}_t\)代入上式化简可得:

\[\tilde{\mu}_t(x_t,x_0)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\overline{\epsilon}_t) \]

结合上述简化得到的训练目标,我们将 \(\mu_{\theta}(x_t,t)\) 重参数化为:

\[\begin{align*} \mu_{\theta}(x_t,t) &= \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(x_t,t)) \\ &= \frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\overline{\epsilon_t},t)) \end{align*} \]

即,使用神经网络建模预测加噪数据 \(x_t\) 中相比原始数据 \(x_0\) 添加的噪声 \(\overline{\epsilon}_t\) 。化简并去掉权重系数,我们得到最终的损失函数:

\[L=\mathbb{E}_{x_0,\epsilon\sim\mathcal{N}(0,1)}\left[||\epsilon-\epsilon_{\theta}(\sqrt{\overline{\alpha}_t}x_0+\sqrt{1-\overline{\alpha}_t}\epsilon,t)||^2\right] \]

网络训练完毕后,反向过程采样的分布为:

\[x_{t-1}=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_{\theta}(x_t,t))+\sigma_tz,z\sim\mathcal{N}(0,1) \]

posted @ 2025-03-17 13:28  kawhiSYZ  阅读(182)  评论(0)    收藏  举报