【BlockSwap】2020-ICLR-BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget-论文阅读
BlockSwap
2020-ICLR-BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget
来源: ChenBong 博客园
- Institute:University of Edinburgh
- Author:Jack Turner,Michael O'Boyle
- GitHub:https://github.com/BayesWatch/pytorch-blockswap 【20+】
- Citation:1
Introduction
- backbone network (consist of standard block)
-
- cheap blocks pool ==> (swap backbone's block) candidate blcok-swap networks space
-
- under constraint sample ==> compute fisher score ==> ranking network
-
- Distillation (T: backbone network, S: block-swap network)
Contribution
- block-wise 的 替换,相对于 NAS(bottom-up method)来说降低了搜索空间,速度更快,相对于 改变网络的深度/宽度方法,如剪枝等(top-dowm method)来说搜索维度更高(剪枝只能修改每一层的filter个数,block swap 还可以修改层的类型)
- 基于 Fisher information 的候选网络快速评估算法
Method
Fisher Information
- 泰勒展开来计算filter重要性的方法 与 计算 Fisher information 的方法等价
- \(\Delta_{c}=\frac{1}{2 N} \sum_{n}^{N}\left(\sum_{i}^{W} \sum_{j}^{H} a_{n i j} g_{n i j}\right)^{2}\)
- feature map大小 W×H, \(\sum a_{ij}*g_{ij}\) 衡量一个(filter输出的)channel 的重要性 \(\Delta_{c}\)
- \(\Delta_{b}=\sum_{c}^{C} \Delta_{c}\)
- C是一个 blcok 的总通道数;一个block的重要性表示为 \(\Delta_{b}\)
- \(\sum_B \Delta_{b}\)
- B 是一个 swap-block network 的 blcok 数量,一个 swap-block network 的重要性表示为: \(\sum_B \Delta_{b}\)
Substitute Blocks
-
Standard Block
- 参数量: \(2N^2k^2\)
-
Grouped+Pointwise Block – G(g)
- 参数量: \(2((N^2k^2)/g+N^2)\)
-
Bottleneck Block – B(b)
- 参数量:\((N/b)^2k^2+2N^2/b\)
-
Bottleneck Grouped+Pointwise Block – BG(b, g)
- 参数量: \((N/bg)^2k^2+2N^2/b\)
Distillation
\(\mathcal{L}_{A T}=\mathcal{L}_{C E}+\beta \sum_{i=1}^{L}\left\|\frac{\mathbf{f}\left(A_{i}^{t}\right)}{\left\|\mathbf{f}\left(A_{i}^{t}\right)\right\|_{2}}-\frac{\mathbf{f}\left(A_{i}^{s}\right)}{\left\|\mathbf{f}\left(A_{i}^{s}\right)\right\|_{2}}\right\|_{2} \qquad (1)\)
\(\mathbf{f}\left(A_{i}\right)=\left(1 / N_{A_{i}}\right) \sum_{j=1}^{N_{A_{i}}} \mathbf{a}_{i j}^{2}\) ,其中 \(i=1,2,...,L\) ,\(N_{A_i}\) is the number of channels at layer i.
- blocks pool ==> candidate blcok-swap networks space ==>
- under constraint sample ==> compute fisher score ==> ranking network
- Distillation (T: backbone network, S: block-swap network)
Experiments
CIFAR-10
Setup
- momentum:0.9
- lr:init 0.1,cosine
- minibatch size:128
- weight decay:5e-4
- β:1000
Teacher Network:
- 3 个 WRN-40-2(depth 40,width multiplier 2,18 blocks,2.2M params)
Student Network:
params constraint:200K, 400K, 600K, 800K
- WRN-16-2 / WRN-40-1 / WRN-16-1
- WRN-40-2 + mixed swap
- WRN-40-2 + Single swap (MBConv6 / DARTS / DenseNet)
- WRN-40-2 + SNIP pruning
- WRN-40-2 + \(l1\) pruning
ImageNet
Setup
- momentum:0.9
- lr:init 0.1,step:30,60,90
- minibatch size:256
- weight decay:1e-4
- β:750
Teacher Network:
- 1 个 ResNet34(16 blocks, 21.8M params)
Student Network:
params constraint:8M,3M
- ResNet18 / ResNet18-0.5 (the channel width in the last 3 sections has been halved)
- ResNet34 + mixed swap
- ResNet34 + Single swap (G(4) / G(N))
Ablation Study
mixed block VS. single blcok
mix swap 总是存在比 single swap 更好的结构
One minibatch VS. N minibatch && Ranking correlation
final err 与不同 batch 时下列指标的相关性:
- acc
- weight l2 norm
- grad l1 norm
- fisher score
Sample Num?
BlockSwap finds networks with final test errors of 4.85%. 4.54%, and 4.21% after 10, 100, and 1000 samples respectively.
We empirically found that 1000 samples.
Conclusion
Summary
To Read
Reference
https://blog.csdn.net/xbinworld/article/details/104591706