（论文阅读）Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

1. 论文

题目：Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
代码： https://github.com/CrossmodalGroup/DynamicVectorQuantization
会议/期刊： CVPR2023
摘要：

早先的 VQ 方法通常将图像分割为固定尺寸的区域，每个区域具有固定长度的code。作者提出了一种 Dynamic Vector Qualification，核心思想是基于信息密度分割图像为不同粒度的区域，粗粒度（平滑）区域拥有更少的codes，细粒度（细节）区域拥有更多codes。此外，不同于早先的顺序生成，本文还提出了由粗粒度区域到细粒度区域生成的顺序。

2. Vector Qualification

首先定义 codebook:

\[\mathcal{C}:=\{ (k, e(k)) \}_{k \in [K]}, \]

其中， \(K\) 是codebook size，\(n_z\) 是 codes 维度。

(a) 提取特征

给定一张输入图片 \(\mathbf{X} \in \mathbb{R}^{H_0 \times W_0 \times 3}\) ，经过编码器 \(E\) 提取区域特征：

\[\mathbf{Z} = E(\mathbf{X}) \in \mathbb{R}^{H \times W \times n_z}, \]

其中 \((H,W)=(H_0/f, W_0/f)\)，\(f\) 是 downsampling factor.

(b) 向量量化

接下来我们对特征 \(Z\) 进行向量量化。对于 \(Z\) 中的 \(\forall z_0 \in \mathcal{R}^{n_z}\)，经过一个量化算子 \(\mathcal{Q}(\cdot)\)：

\[\mathcal{Q}(z,\mathcal{C}) = argmin_{k \in [K]} \| z - e_k\|_2^2, \]

其中，quantized code 是 \(\mathcal{Q}(z,\mathcal{C})\)，量化向量是 \(z^q = e(\mathcal{Q}(z,\mathcal{C}))\)。这样我们就得到了量化的编码特征 \(Z^q \in \mathbb{R}^{H \times W \times n_z}\)。

(c) 重构训练

随后，经过一个解码器 \(D\) 重构输入图像:

\[\bar X = D(Z^q). \]

不足：每一个 code 都代表了固定 \(f^2\) 大小的图像块，每一个图像区域也由相同长度的code 表示，没有区分区域之间不同的信息密度。

3. 所提出方法：

3.1 Stage 1: Dynamic-Quantization VAE (DQ-VAE)

(1) 区域分割：

定义一系列下采样因子（实验中 K 取 2）：

\[F=\{ f_1, f_2, \cdots, f_K\}, \quad f_1 < f_2 < \cdots <f_K, \]

图像就可以通过一个分层 encoder \(E_h\) 被编码为相对应的分层特征：

\[\mathbf{Z} = \{ \mathbf{Z}_1, \mathbf{Z}_2, \cdots,\mathbf{Z}_K\} \]

其中 \(Z_i \in \mathbb{R}^{H_i \times W_i \times n_z},\) \((H_i, W_i) = (H_0 / f_i, W_0/f_i) \quad for \quad \forall i \in \{1, 2, \cdots, K\}\).

Remark 1：为了匹配 encoder，会使用最近邻复制，来确保不同层次的特征数量一致。

**Dynamic Grained Coding （DGC） module: **

首先基于池化、归一的分层图像特征 \(\mathbf{Z}' = \{ \mathbf{Z}_1', \mathbf{Z}_2', \cdots, \mathbf{Z}_K'\}\) 计算每个图像区域的门控 logits：

\[G = (\mathbf{Z}_1'\| \mathbf{Z}_2'\|\cdots \| \mathbf{Z}_K') W_g \in \mathbb{R}^{H_s \times W_s \times K}, \quad (H_s,W_s) = (H_0 / f_K, W_0 / f_K). \]

接着，每个区域 \((i,j)\) 的logits \(g_{i,j} \in \mathbb{R}^K\) 被用来计算这个区域的粒度大小（即分成几块）：