cs294-ai-sys2022 lectures11 reading

1. The State of Sparsity in Deep Neural Networks（2019 optional）
工作
- We perform a comprehensive evaluation of variational dropout , l0 regularization (Louizos et al., 2017b), and magnitude pruning (Zhu & Gupta, 2017) on Transformer trained on WMT 2014 English-to-German and ResNet-50 trained on ImageNet.
- While variational dropout and l0 regularization achieve state-of-the-art results on small datasets, we show that they perform inconsistently for largescale tasks and that simple magnitude pruning can achieve comparable or better results for a reduced computational budget.
- we achieve a new state-of-the-art sparsity-accuracy trade-off for ResNet-50 using only magnitude pruning
- We repeat the lottery ticket (Frankle & Carbin, 2018) and scratch (Liu et al., 2018) experiments on Transformer and ResNet-50 across a full range of sparsity levels. We show that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same testset performance as a model trained with pruning as part of the optimization process.
Sparsity in Neural Networks
- Some of the earliest techniques for sparsifying neural networks make use of second-order approximation of the loss surface to avoid damaging model quality (LeCun et al., 1989; Hassibi & Stork, 1992). More recent work has achieved comparable compression levels with more computationally efficient first-order loss approximations, and further refinements have related this work to efficient empirical estimates of the Fisher information of the model parameters (Molchanov et al., 2016; Theis et al., 2018).
Evaluating Sparsification Techniques at Scale
- Magnitude Pruning
  - Magnitude-based weight pruning schemes use the magnitude of each weight as a proxy for its importance to model quality, and remove the least important weights according to some sparsification schedule over the course of training
  - we use the approach introduced in Zhu & Gupta (2017), which is conveniently available in the TensorFlow model pruning library. This technique allows for masked weights to reactivate during training based on gradient updates, and makes use of a gradual sparsification schedule with sorting-based weight thresholding to achieve a user specified level of sparsification. These features enable high compression ratios at a reduced computational cost relative to the iterative pruning and re-training approach
  - Variational Dropout
    - It was later demonstrated that by learning a model with variational dropout and per-parameter dropout rates, weights with high dropout rates can be removed post-training to produce highly sparse solutions
  - l0 Regularization
    - l0 regularization explicitly penalizes the number of nonzero weights in the model to induce sparsity. However, the l0-norm is both non-convex and non-differentiable. To address the non-differentiability of the l0-norm, Louizos et al. (2017b) propose a reparameterization of the neural network weights as the product of a weight and a stochastic gate variable sampled from a hard-concrete distribution. The parameters of the hard-concrete distribution can be optimized directly using the reparameterization trick, and the expected l0-norm can be computed using the value of the cumulative distribution function of the random gate variable evaluated at zero.
Sparse Neural Machine Translation
- While l0 regularization and variational dropout produce the top performing models in the low-to-mid sparsity range, magnitude pruning achieves the best results for highly sparse models.
- What is particularly notable about the performance of magnitude pruning is that our experiments uniformly remove the same fraction of weights for each layer. This is in stark contrast to variational dropout and l0 regularization(每层稀疏化比例不一样，结果应该比每层都一样好), where the distribution of sparsity across the layers is learned through the training process. Previous work has shown that a nonuniform sparsity among different layers is key to achieving high compression rates (He et al., 2018), and variational dropout and l0 regularization should theoretically be able to leverage this feature to learn better distributions of weights for a given global sparsity.
- Figure 2 shows the distribution of sparsity across the different layer types in the Transformer for the top performing model at 90% global sparsity for each technique. Both l0 regularization and variational dropout learn to keep more parameters in the embedding, FFN layers, and the output transforms for the multi-head attention modules and induce more sparsity in the transforms for the query and value inputs to the attention modules. Despite this advantage, l0 regularization and variational dropout did not significantly outperform magnitude pruning, even yielding inferior results at high sparsity levels.
- It is also important to note that these results maintain a constant number of training steps across all techniques and that the Transformer variant with magnitude pruning trains 1.24x and 1.65x faster than l0 regularization and variational dropout respectively.
Sparse Image Classification
- While achieving a good accuracy-sparsity trade-off, variational dropout consistently ranked behind l0 regularization on Transformer, and was bested by magnitude pruning for sparsity levels of 80% and up. However, on ResNet-50 we observe that variational dropout consistently produces models on-par or better than magnitude pruning, and that l0 regularization is not able to produce sparse models at all. Variational dropout achieved particularly notable results in the high sparsity range, maintaining a top-1 accuracy over 70% with less than 4% of the parameters of a standard ResNet-50.
- While variational dropout is able to learn to distribute sparsity non-uniformly across the layers, it comes at a significant increase in resource requirements. For ResNet-50 trained with variational dropout we observed a greater than 2x increase in memory consumption.
Pushing the Limits of Magnitude Pruning
- To understand the limits of the magnitude pruning heuristic(修改每层稀疏程度相同的前提), we modify our ResNet-50 training setup to leave the first convolutional layer fully dense, and only prune the final fully-connected layer to 80% sparsity. This heuristic is reasonable for ResNet-50, as the first layer makes up a small fraction of the total parameters in the model and the final layer makes up only .03% of the total FLOPs. While tuning the magnitude pruning ResNet-50 models, we observed that the best models always started and ended pruning during the third learning rate phase, before the second learning rate drop. To take advantage of this, we increase the number of training steps by 1.5x by extending this learning rate region. Results for ResNet-50 trained with this scheme are plotted in figure 5.
- With these modifications, magnitude pruning outperforms variational dropout at all but the highest sparsity levels while still using less resources. However, variational dropout’s performance in the high sparsity range is particularly notable. With very low amounts of non-zero weights, we find it likely that the models performance on the test setis closely tied to precise allocation of weights across the different layers, and that variational dropout’s ability to learn this distribution enables it to better maintain accuracy at high sparsity levels. This result indicates that efficient sparsification techniques that are able to learn the distribution of sparsity across layers are a promising direction for future work.
Sparsification as Architecture Search（保留的结构，即一种nas）
- While sparsity is traditionally thought of as a model compression technique, two independent studies have recently suggested that the value of sparsification in neural networks is misunderstood, and that once a sparse topology is learned it can be trained from scratch to the full performance（得到一种神经网络结构，可以从头训练，winning lottery tickets中奖彩票） achieved when sparsification was performed jointly with optimization.
- Across all of our experiments, we observed that training from scratch using a learned sparse architecture is not able to match the performance of the same model trained with sparsification as part of the optimization process.
- Across both models, we observed that doubling the number of training steps did improve the quality of the results for the scratch experiments, but was not sufficient to match the test set performance of the magnitude pruning baseline. As sparsity increased, we observed that the deviation between the models trained with magnitude pruning and those trained from scratch increased. For both models, we did not observe a benefit from using the augmented weight initialization for the scratch experiments.

2. WHAT IS THE STATE OF NEURAL NETWORK PRUNING?（2019 optional）

动机：
- After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics.
工作：
- A meta-analysis of the neural network pruning literature based on comprehensively aggregating reported results from 81 papers.
- A catalog of problems in the literature and best practices for avoiding them. These insights derive from analyzing existing work and pruning hundreds of models.
- ShrinkBench, an open-source library for evaluating neural network pruning methods
　OVERVIEW OF PRUNING
- Differences Betweeen Pruning Methods
  - Structure vs unstructured
  - Scoring. It is common to score parameters based on their absolute values, trained importance coefficients, or contributions to network activations or gradients. Some pruning methods compare scores locally, pruning a fraction of the parameters with the lowest scores within each structural subcomponent of the network (e.g., layers) (Han et al., 2015). Others consider scores globally, comparing scores to one another irrespective of the part of the network in which the parameter resides
  - Scheduling. Pruning methods differ in the amount of the network to prune at each step. Some methods prune all desired weights at once in a single step (Liu et al., 2019). Others prune a fixed fraction of the network iteratively over several steps (Han et al., 2015) or vary the rate of pruning according to a more complex function (Gale et al., 2019).
  - Fine-tuning. For methods that involve fine-tuning, it is most common to continue to train the network using the trained weights from before pruning. Alternative proposals include rewinding the network to an earlier state (Frankle et al., 2019) and reinitializing the network entirely (Liu et al., 2019).

3. Linear Mode Connectivity and the Lottery Ticket Hypothesis(require 2020)

动机
- We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation）
Introduction
- Instability analysis，
  - The goal of instability analysis is to determine whether the outcome of optimizing a particular neural network is stable to SGD noise. First, we create a network N with random initialization W0. We then train two copies of N with different samples of SGD noise (i.e., different random data orders and augmentations). Finally, we compare the resulting networks to measure the effect of these different samples of SGD noise on the outcome of optimization. If the networks are sufficiently similar according to a criterion, we determine N to be stable to SGD noise. We also study this behavior starting from the state of N at step k of training (Figure 1 right). Doing so allows us to determine when the outcome of optimization becomes stable to SGD noise.
- - There are many possible ways in which to compare the networks that result from instability analysis (Appendix G). We use the behavior of the optimization landscape along the line between these networks (blue in Figure 1). Does error remain flat or even decrease (meaning the networks are in the same, linearly connected minimum), or is there a barrier of increased error? We define the linear interpolation instability of N to SGD noise as the maximum increase in error along this path (red). We consider N stable to SGD noise if error does not increase along this path, i.e., instability is nearly equal to 0. This means N will find the same, linearly connected minimum regardless of the sample of SGD noise.
  - we assess a linear form of mode connectivity, a phenomenon where the minima found by two networks are connected by a path of nonincreasing error
- The lottery ticket hypothesis
  - On a range of image classification benchmarks including standard networks on ImageNet, we observe that networks become stable to SGD noise early in training.
  - We use instability analysis to distinguish successes and failures of IMP (the method behind extant lottery ticket results). Namely, sparse IMP subnetworks are matching only when they are stable to SGD noise.
  - We generalize IMP to find subnetworks early in training rather than at initialization. We show that IMP subnetworks become stable and matching when set to their weights from early in training, making it possible to extend the lottery ticket observations to larger scales
Instability Analysis of Unpruned Networks
- we train a network for k steps, make two copies, train them to completion with different samples of SGD noise, and linearly interpolate.
- We find that, although only LeNet is stable to SGD noise at initialization, every network becomes stable early in training, meaning the outcome of optimization from that point forward is determined to a linearly connected minimum. For example, they do so at iterations 2000 for ResNet-20 and 1000 VGG-16; in other words, after 3% and 1.5% of training, SGD noise does not affect the final linearly connected minimum. ResNet-50 and Inception-v3 become stable later:
Instability Analysis of Lottery Tickets
- One of our contributions is to generalize IMP to rewind to any step k. Frankle & Carbin refer to subnetworks that match the accuracy of the full network as winning tickets because they have “won the initialization lottery” with weights that make attaining this accuracy possible. When we rewind to iteration k > 0, subnetworks are no longer randomly initialized, so the term winning ticket is no longer appropriate. Instead, we refer to such subnetworks simply as matching.
- In summary, at these extreme sparsities, IMP subnetworks are matching when they are stable.

4. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

OVERVIEW OF SPARSITY IN DEEP LEARNING
- Performance and model storage

5. A Survey of Quantization Methods for Efficient Neural Network Inference(2021 optional)

GENERAL HISTORY OF QUANTIZATION
- Thus, it is possible to have high error/distance between a quantized model and the original non-quantized model, while still attaining very good generalization performance. This added degree of freedom was not present in many of the classical research, which mostly focused on finding compression methods that would not change the signal too much, or with numerical methods in which there was strong control on the difference between the “exact” versus the “discretized” computation. This observation that has been the main driver for researching novel techniques for NN quantization.
- Finally,the layered structure of Neural Net models offers an additional dimension to explore. Different layers in a Neural Net have different impact on the loss function, and this motivates a mixed-precision approach to quantization.
BASIC CONCEPTS OF QUANTIZATION
- Uniform Quantization
- Symmetric and Asymmetric Quantization
  - A straightforward choice is to use the min/max of the signal for the clipping range, i.e., alpha = r_min, and beta = r_max. This approach is an asymmetric quantization scheme, since the clipping range is not necessarily symmetric with respect to the origin, i.e., -alpha !=beta, as illustrated in Figure 2 (Right). It is also possible to use a symmetric quantization scheme by choosing a symmetric clipping range of -alpha = beta.
  - Asymmetric quantization often results in a tighter clipping range as compared to symmetric quantization. This is especially important when the target weights or activations are imbalanced, e.g., the activation after ReLU that always has non-negative values. Using symmetric quantization, however, simplifies the quantization function in Eq. 2 by replacing the zero point with Z = 0:（可以将relu后的量化到uint8）
  - One approach to address this is to use percentile instead of min/max of the signal [172]. That is to say, instead of the largest/smallest value, the i-th largest/smallest values are used as beta/alpha. Another approach is to select alpha and beta to minimize KL divergence (i.e., information loss) between the real values and the quantized values [176]. We refer the interested readers to [255] where the different calibration methods are evaluated on various models
- Range Calibration Algorithms: Static vs Dynamic Quantization
  - Dynamic quantization dynamically computes the clipping range of each activation and often achieves the highest accuracy. However, calculating the range of a signal dynamically is very expensive, and as such, practitioners most often use static quantization where the clipping range is fixed for all inputs.
- Quantization Granularity
  - Layerwise Quantizations
  - Groupwise Quantization, One could group multiple different channels inside a layer to calculate the clipping range (of either activations or convolution kernels). This could be helpful for cases where the distribution of the parameters across a single convolution/activation varies a lot. For instance, this approach was found useful in Q-BERT [219] for quantizing Transformer [243] models that consist of fully-connected attention layers.
  - Channelwise quantization is currently the standard method used for quantizing convolutional kernels. It enables the practitioner to adjust the clipping range for each individual kernel with negligible overhead.
  - sub-channelwise quantization may result in significant overhead and is not currently the standard choice
- Generally, non-uniform quantization enables us to better capture the signal information, by assigning bits and discreitizing the range of parameters non-uniformly. However, non-uniform quantization schemes are typically difficult to deploy efficiently on general computation hardware, e.g., GPU and CPU. As such, the uniform quantization is currently the de-facto method due to its simplicity and its efficient mapping to hardware.
- Fine-tuning Methods
  - Quantization-Aware Training
    - the usual forward and backward pass are performed on the quantized model in floating point, but the model parameters are quantizedafter each gradient update (similar to projected gradient descent).
    - An important subtlety in backpropagation is how the the non-differentiable quantization operator (Eq. 2) is treated. Without any approximation, the gradient of this operator is zero almost everywhere, since the rounding operation in Eq. 2 is a piece-wise flat operator. A popular approach to address this is to approximate the gradient of this operator by the so-called Straight Through Estimator (STE) [13]. STE essentially ignores the rounding operation and approximates it with an identity function, as illustrated in Figure 5.
    - In addition to adjusting model parameters, some prior work found it effective to learn quantization parameters during QAT as well. PACT [36] learns the clipping ranges of activations under uniform quantization
    - QAT has been shown to work despite the coarse approximation of STE. However, the main disadvantage of QAT is the computational cost of re-training the NN model. This re-training may need to be performed for several hundred epochs to recover accuracy, especially for low-bit precision quantization. If a quantized model is going to be deployed for an extended period, and if efficiency and accuracy are especially important, then this investment in re-training is likely to be worth it. However, this is not always the case, as some models have a relatively short lifetime.
  - Post-Training Quantization
    - multiple approaches have been proposed to mitigate the accuracy degradation of PTQ. For example, [11, 63] observe inherent bias in the mean and variance of the weight values following their quantization and propose bias correction methods; and [174, 182] show that equalizing the weight ranges (and implicitly activation ranges) between different layers or channels can reduce quantization errors. ACIQ [11] analytically computes the optimal clipping range and the channel-wise bitwidth setting for PTQ. Although ACIQ can achieve low accuracy degradation, the channel-wise activation quantization used in ACIQ is hard to efficiently deploy on hardware. In order to address this, the OMSE method [40] removes channel-wise quantization on activation and proposes to conduct PTQ by optimizing the L2 distance between the quantized tensor and the corresponding floating point tensor. Furthermore, to better alleviate the adverse impact of outliers on PTQ, an outlier channel splitting (OCS) method is proposed in [281] which duplicates and halves the channels containing outlier values. Another notable work is AdaRound [181] which shows that the naive round-to-nearest method for quantization can counter-intuitively results in sub-optimal solutions, and it proposes an adaptive rounding method
  - Zero-shot Quantization
    - Zero Shot (aka data free) quantization performs the entire quantization without any access to the training/validation data. This is particularly important for Machine Learning as a Service (MLaaS) providers who want to accelerate the deployment of a customer’s workload, without the need to access their dataset. Moreover, this is important for cases where security or privacy concerns may limit access to the training data.
- ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS
  - Simulated and Integer-only Quantization
    - In simulated quantization, the quantized model parameters are stored in low-precision, but the operations (e.g. matrix multiplications and convolutions) are carried out with floating point arithmetic
    - However, both methods are limited to ReLU activation. The recent work of [132] addresses this limitation by approximating GELU [94], Softmax, and Layer Normalization [6] with integer arithmetic and further extends integer-only quantization to Transformer [243] architectures.
  - Mixed-Precision Quantization
    - In this approach, each layer is quantized with different bit precision, as illustrated in Figure 8. One challenge with this approach is that the search space for choosing this bit setting is exponential in the number of layers
    - The recent work of [246] proposed a reinforcement learning (RL) based method to determine automatically the quantization policy, and the authors used a hardware simulator to take the hardware accelerator’s feedback in the RL agent feedback. The paper [254] formulated the mixed-precision configuration searching problem as a Neural Architecture Search (NAS) problem and used the Differentiable NAS (DNAS) method to efficiently explore the search space. One disadvantage of these exploration-based methods [246, 254] is that they often require large computational resources, and their performance is typically sensitive to hyperparameters and even initialization.
    - Another class of mixed-precision methods uses periodic function regularization to train mixed-precision models by automatically distinguishing different layers and their varying importance with respect to accuracy while learning their respective bitwidths [184].
    - Different than these exploration and regularizationbased approaches, HAWQ [51] introduces an automatic way to find the mixed-precision settings based on secondorder sensitivity of the model. It was theoretically shown that the trace of the second-order operator (i.e., the Hessian) can be used to measure the sensitivity of a layer to quantization [50], similar to results for pruning in the seminal work of Optimal Brain Damage [139]. In HAWQv2, this method was extended to mixedprecision activation quantization [50], and was shown to be more than 100x faster than RL based mixed-precision methods [246]. Recently, in HAWQv3, an integer-only, hardware-aware quantization was introduced [267] that proposed a fast Integer Linear Programming method to find the optimal bit precision for a given applicationspecific constraint (e.g., model size or latency). This work also addressed the common question about hardware efficiency of mixed-precision quantization by directly deploying them on T4 GPUs, showing up to 50% speed up with mixed-precision (INT4/INT8) quantization as compared to INT8 quantization.
  - Hardware Aware Quantization
  - Distillation-Assisted Quantization
  - Extreme Quantization
    - Extreme lowbit precision quantization is a very promising line of research. However, existing methods often incur high accuracy degradation as compared to baseline, unless very extensive tuning and hyperparameter search is performed. But this accuracy degradation may be acceptable for less critical applications.
- QUANTIZATION AND HARDWARE PROCESSORS
  - Here, we briefly discuss different hardware platforms in the context of quantization. ARM Cortex-M is a group of 32-bit RISC ARM processor cores that are designed for low-cost and power-efficient embedded devices. For instance, the STM32 family are the microcontrollers based on the ARM Cortex-M cores that are also used for NN inference at the edge. Because some of the ARM Cortex-M cores do not include dedicated floatingpoint units, the models should first be quantized before deployment. CMSIS-NN [136] is a library from ARM that helps quantizing and deploying NN models onto the ARM Cortex-M cores. Specifically, the library leverages fixed-point quantization [113, 154, 267] with power-oftwo scaling factors so that quantization and dequantization processes can be carried out efficiently with bit shifting operations. GAP-8 [64], a RISC-V SoC (System on Chip) for edge inference with a dedicated CNN accelerator, is another example of an edge processor that only supports integer arithmetic. While programmable general-purpose processors are widely adopted due to their flexibility, Google Edge TPU, a purpose-built ASIC chip, is another emerging solution for running inference at the edge. Unlike Cloud TPUs that run in Google data centers with a large amount of computing resources, the Edge TPU is designed for small and low-power devices, and thereby it only supports 8-bit arithmetic. NN models must be quantized using either quantization-aware training or posttraining quantization of TensorFlow.
  - A good example is provided by the recent NVIDIA Turing GPUs, and in particular T4 GPUs, which include the Turing Tensor Cores. Tensor Cores are specialized execution units designed for efficient lowprecision matrix multiplications
- FUTURE DIRECTIONS FOR RESEARCH IN QUANTIZATION
  - Quantization Software，There are several software packages that can be used to deploy INT8 quantized models (e.g., Nvidia’s TensorRT, TVM, etc.). Moreover, support for INT4 quantization was only recently added to TVM
  - Hardware and NN Architecture Co-Design
  - Coupled Compression Methods
  - Quantized Training: Perhaps the most important use of quantization has been to accelerate NN training with half-precision [41, 72, 79, 175]. This has enabled the use of much faster and more power-efficient reduced-precision logic for training. However, it has been very difficult to push this further down to INT8 precision training. While several interesting works exist in this area

6. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference（2017 require google，重要）

工作
- We provide a quantization scheme (section 2.1) that quantizesh both weights and activations as 8-bit integers, and just a few parameters (bias vectors) as 32-bit integers. •
- We provide a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware such as the Qualcomm Hexagon (sections 2.2, 2.3), and we describe an efficient, accurate implementation on ARM NEON (Appendix B). •
- We provide a quantized training framework (section 3) co-designed with our quantized inference to minimize the loss of accuracy from quantization on real models
Quantized Inference
- Quantization scheme
  - 公式太多
Training with simulated quantization
- A common approach to training quantized networks is to train in floating point and then quantize the resulting weights (sometimes with additional post-quantization training for fine-tuning). We found that this approach works sufficiently well for large models with considerable representational capacity, but leads to significant accuracy drops for small models. Common failure modes for simple posttraining quantization include: 1) large differences (more than 100×) in ranges of weights for different output channels (section 2 mandates that all channels of the same layer be quantized to the same resolution, which causes weights in channelswith smaller ranges to havemuch higher relative error) and 2) outlier weight values that make all remaining weights less precise after quantization.

7. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search（2019 require）

Introduction
- Intractable design space
  - VGG16 contains 16 layers. Assume for each layer of the network, we can choose a different kernel size from f1; 3; 5g and a different filter number from f32; 64; 128; 256; 512g. Even with such simplified design choices and shallow layers, the design space contains (3*5)^16 possible architectures
  - recent works adopt neural architecture search (NAS) to explore the design space automatically. Many previous works [30, 31, 20] use reinforcement learning (RL) to guide the search and a typical flow is illustrated in Figure 2(b). A controllersamples architectures from the search space to be trained. To reduce the training cost, sampled architectures are trained on a smaller proxy dataset such as CIFAR-10 or trained for fewer epochs on ImageNet. The performance of the trained networks is then used to train and improve the controller. Previous works [30, 31, 20] has demonstrated the effectiveness of such methods in finding accurate and efficient ConvNet models. However, training each architecture is still time consuming, and it usually takes thousands of architectures to train the controller. As a result, the computational cost of such methods is prohibitively high.
- Nontransferable optimality
  - the optimality of ConvNet architectures is conditioned on many factors such as input resolutions and target devices. A smaller input resolution may require a smaller receptive field of the network and therefore shallower layers. On a different device, the same operator can have different latency,. In practice, we can only realistically design one ConvNet and use it for all conditions.
- Inconsistent efficiency metrics
  - NasNet- A [31] has a similar FLOP count as MobileNetV1 [6], but its complicated and fragmented cell-level structure is not hardware friendly, so the actual latency is slower
- 工作
  - we propose to use differentiable neural architecture search (DNAS) to discover hardware-aware efficient ConvNets. The flow of our algorithm is illustrated in Figure 1. DNAS allows us to explore a layer-wise search space where we can choose a different block for each layer of the network. Following [21], DNAS represents the search space by a super net whose operators execute stochastically. We relax the problem of finding the optimal architecture to find a distribution（目标变了，找分布而不是找架构） that yields the optimal architecture. By using the Gumbel Softmax technique [9], we can directly train the architecture distribution using gradient-based optimization such as SGD. The search process is extremely fast compared with previous reinforcement learning (RL) based method. The loss used to train the stochastic super net consists of both the cross-entropy loss that leads to better accuracy and the latency loss that penalizes the network’s latency on a target device.
Method
- The Search Space
  - Previous works [30, 31, 16, 11, 12] focus on cell level architecture search. Once a cell structure is searched, it is used in all the layers across the network. However, many searched cell structures are very complicated and fragmented and are therefore slow when deployed to mobile CPUs [17, 13]. Besides, at different layers, the same cell structure can have a different impact on the accuracy and latency of the overall network. As shown in [20] and in our experiments, allowing different layers to choose different blocks leads to better accuracy and efficiency.
  - In this work, we construct a layer-wise search space with a fixed macro-architecture, and each layer can choose a different block. The macro-architecture is described in Table 1. The macro architecture defines the number of layers and the input/output dimensions of each layer. The first and the last three layers of the network have fixed operators. For the rest of the layers, their block type needs to be searched. The filter numbers for each layer are hand-picked empirically. We use relatively small channel sizes for early layers, since the input resolution at early layers is large, and the computational cost (FLOP count) is quadratic to input size.
  - The block structure is inspired by MobileNetV2 [17] and ShiftNet [23], and is illustrated in Figure 3. It contains a point-wise (1x1) convolution, a K-by-K depthwise convolution where K denotes the kernel size, and another 1x1 convolution. “ReLU” activation functions follow the first 1x1 convolution and the depthwiseconvolution, but there are no activation functions following the last 1x1 convolution. If the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output. Following [17, 23], we use a hyperparameter, the expansion ratio e, to control the block. It determines how much do we expand the output channel size of the first 1x1 convolution compared with its input channel size. Following [20], we also allow choosing a kernel size of 3 or 5 for the depthwise convolution. In addition, we can choose to use group convolution for the first and the last 1x1 convolution to reduce the computation complexity. When we use group convolution, we follow [29] to add a channel shuffle operation to mix the information between channel groups. In our experiments, our layer-wise search space contains 9 candidate blocks, with their configurations listed in Table 2. Note we also have a block called “skip”, which directly feed the input feature map to the output without actual computations. This candidate block essentially allows us to reduce the depth of the network.
  - In summary, our overall search space contains 22 layers and each layer can choose from 9 candidate blocks from Table 2, so it contains 9^22 = 10^21 possible architectures. Finding the optimal layer-wise block assignment from such enormous search space is a non-trivial task.
- Latency Aware Loss Function
  - we use a latency lookup table model to estimate the overall latency of a network based on the runtime of each operator
- The Search Algorithm
  - The inner problem of optimizing weight w_a involves training a neural network. For ImageNet classification, training a ConvNet typically takes several days or even weeks. The outer problem has a combinatorially large search space. To reduce the computational cost, the inner problem is replaced by training candidate architectures on an easier proxy dataset. For example, [30, 31] trains the architecture on the CIFAR10 dataset, and [20] trains on ImageNet but only for 5 epochs. The learned architectures are then transferred to the target dataset. To avoid exhaustively iterating through the search space, [30, 31, 20] use reinforcement learning to guide the exploration. Despite these improvements, solving problem (1) is still prohibitively expensive
  - We adopt a different paradigm of solving problem (1). We first represent the search space by a stochastic super net. The super net has the same macro-architecture as described in Table 1, and each layer contains 9 parallel blocks as described in Table 2. During the inference of the super net, only one candidate block is sampled and executed with the sampling probability of
  - Our search process is now equivalent to training the stochastic super net. During the training, we compute @L=@wa to train each operator’s weight in the super net. This is no different from training an ordinary ConvNet. After operators get trained, different operators can have a different contribution to the accuracy and the efficiency of the overall network. Therefore, we compute @L=@ to update the sampling probability P for each operator. This step selects operators with better accuracy and lower latency and suppresses the opposite ones. After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution P.

8. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks（2019 optional）

动机
- ultra low precision quantization could lead to significant degradation in model generalization. A promising method to address this is to perform mixed-precision quantization, where more sensitive layers are kept at higher precision. However, the search space for a mixed-precision quantization is exponential in the number of layers. Recent work has proposed a novel Hessian based framework [7], with the aim of reducing this exponential search space by using second-order information. While promising, this prior work has three major limitations: (i) they only use the top Hessian eigenvalue as a measure of sensitivity and do not consider the rest of the Hessian spectrum; (ii) their approach only provides relative sensitivity of different layers and therefore requires a manual selection of the mixed-precision setting; and (iii) they do not consider mixed-precision activation quantization.
工作
- Here, we present HAWQ-V2 which addresses these shortcomings. For (i), we perform a theoretical analysis showing that a better sensitivity metric is to compute the average of all of the Hessian eigenvalues.
- For (ii), we develop a Pareto frontier based method for selecting the exact bit precision of different layers without any manual selection. For (iii), we extend the Hessian analysis to mixed-precision activation quantization.
Methodology
- Quantization is a non-differentiable function and typically can be addressed by using the Straight Through Estimator (STE) [4] to backpropagate the gradients. See Appendix A for details.
- Trace Weighted Quantization
  - At first, computing the Hessian trace may seem a prohibitive task, as we do not have direct access to the elements of the Hessian matrix. Furthermore, forming the Hessian matrix explicitly is not computationally feasible. However, it is possible to leverage the extensive literature in Randomized Numerical Linear Algebra (RandNLA) [16, 17] which address this type of problem. In particular, the seminar works of [2, 3] have proposed randomized algorithms for fast trace estimation, using so-called matrix-free methods which do not require the explicit formation of the Hessian operator
- Mixed Precision Activation
- Weight Bit Selection
  - this problem can be efficiently addressed using a Pareto frontier approach. The main idea is to sort each candidate bit-precision setting in B based on the total second-order perturbation that they cause, according to the following metric

posted @ 2022-10-08 08:27 撬动地球的coder 阅读(155) 评论(0) 收藏举报

刷新页面返回顶部

qdsjddm

cs294-ai-sys2022 lectures11 reading

公告