Tricks in Deep Neural Networks的翻译

http://blog.csdn.net/pandav5/article/details/51178032

http://blog.csdn.net/dp_bupt/article/details/49308641

深度神经网络（DNN），尤其是卷积神经网络（CNN），使得由多个处理层组成的运算模型可以学习到数据的多层抽象表示。这些方法极大地提高了视觉物体识别、物体探测、文本识别以及其他诸如药物发现和生物学等领域的最先进的水准。

除此之外，很多关于这些主题的具有实质性内容的文章也相继发表，同时很多高质量的CNN开源软件包也被广泛传播。当然也出现了很多有关CNN的教程和CNN软件使用手册。但是，唯一缺少的可能是一个完整介绍如何实现深度卷积神经网络所需要的所有细节的综合性文章。因此，本文就搜集和整理了DNN实现的所有细节，并对所有细节进行讲解。

（一）介绍
本文将从八个方面来讲述DNN的细节实现。包括（1）数据集扩展；（2）图像的预处理；（3）网络的初始化；（4）训练过程中的一些技巧；（5）激活函数的选择；（6）正则化的多种手段；（7）对图像的深入认识；（8）多种深度网络的整合。

（1）数据集的扩展
由于深度网络在训练时需要大量的数据才能达到令人满意的性能，所以当原始数据较少时，最好是通过一些手段来扩展数据集，而且数据扩展也渐渐成为训练深度网络的必要过程。
数据扩展的方法包括常用的水平翻转、随机剪切和颜色抖动，甚至还可以将上述方法组合起来，比如对图像同时进行旋转和随机缩放。此外，也可以对像素的饱和度和纯度进行0.25-4倍范围内的幂次缩放，或者进行0.7-1.4范围内的乘法缩放，或者加上-1~1之间的数值。同样也可以对色度进行类似的加法处理。
Krizhevsky等人在优化训练Alex网络时提出了Fancy PCA[1]-它通过改变RGB信道的强度来扩展数据集。他们发现Fancy PCA近乎能捕获到自然图像的重要特性，即识别物体时对光照的强度和颜色的变化具有不变性。在分类性能上，这一方案将在ImageNet2012的top-1的错误率降低了超过1%。

（2）预处理
得到大量数据集后，是有必要对其进行预处理的，本部分介绍几种预处理方法。
最简单的就是中心化，然后归一化。还有一种类似的就是对数据集的每一维都进行归一化。当然，这种归一化只在不同特征的尺度不同时才有效，而在图像中，由于像素尺度范围一致，所以这种预处理过程对图像而言就没有必要了。
还有一种预处理方法是PCA白化，首先也是中心化，然后通过计算协方差矩阵来发现数据中的关联结构，接着通过将原始数据映射到特征向量来解耦数据，最后就是白化过程，即对特征向量进行归一化。白化过程的缺陷在于它会引入噪声，因为它使得输入的所有维度尺度相同，但实际上有些维度基本上是噪声。这可以通过引入更强的平滑性先验来缓解。
需要注意的是这些预处理在CNN中并非必要。

（3）网络初始化
在理想情况下，经过数据扩展后，假定网络权重一半是正数一半是负数是合理的，所以将所有的权重初始化为0看起来并没有什么问题。但事实证明这是错误的，因为若网络中的每个神经元的输出都一样，那么利用BP计算的梯度也会都一样，使得所有的参数更新都一样，这样就使得神经元之间不存在非对称性。

因此这种完全的零初始化并不可取。但为了使得初始化权重尽可能小些，你可以将它们随机初始化为近0值，这样就可以打破上述的对称性。这种初始化的思想是，神经元在开始时的状态都是随机和独立的，所以它们可以进行不同的更新，从而组合成以个丰富的网络。比如利用0.001*N（0,1）这样分布或者均匀分布来初始化参数。

上述初始化存在的问题是，随机初始化的神经元的输出分布中存在一个随着输入个数变化而变化的变量。实践证明可以通过缩放权重矢量来归一化每个神经元输出的变量。

其中的randn是高斯分布，n是输入个数，这就保证了网络中的所有神经元在初始阶段都有着差不多相同的输出分布，同时也能提高收敛速度。需要注意的是当神经单元是ReLU时，权重的初始化略有不同[2]：

（4）训练过程的一些技巧
a)滤波器和池化大小。在训练时输入图像的大小倾向于选择2的幂次值，比如32,64,224,384,512等。同时，利用较小的滤波器，较小的步幅和0填充值既可以减少参数个数，也可以提升整个网络的精确度。

b)学习率。一如Ilya Sutskever在一篇博客[3]中所说的，利用小批量尺度来分解梯度，这样当你改变批量大小时，就不必总要改变学习率了。利用验证集是一种高效选择学习率的方法。通常将学习率的初始值设为0.1，若在验证集上性能不再提高，可以将学习率除以2或者5，然后如此循环直到收敛。

c)预训练模型的微调。现今，很多最好性能的模型都是由一些著名的研究团队提出来的，比如caffe model zoo 和vgg group。由于这些预训练模型都有着很好的泛化能力，所以可以直接将这些模型运用到自己的数据集上。想要进一步提高这些模型在自己的数据集上的性能，最简单的方法就是在自己的数据集上对这些模型进行微调。一如下表所示，最需要关注的两个参数就是新数据集的大小以及它和原始数据集之间的相似度。不同情形下可利用不同的微调方法。比如，比较简单的情形就是你的数据集和预训练模型的数据集十分相似，若你拥有的数据集很小，你只需利用预训练模型的顶层特征就能训练出一个线性分类器，若你拥有的数据集很大时，最好利用较小的学习率对预训练模型的几个较高层都进行微调。要是你的数据集较大而且和原始数据集之间存在很大的差别，那么需要利用较小的学习率来微调更多的层。要是你的数据集不仅和原始数据集差别大，而且很小，此时训练就变得很困难。因为数据少，最好是只训练一个线性分类器，又因为数据集差别大，利用最顶层的特征来训练这个分类器又不合适，因为最顶层的特征包含了很多关于原始数据集特性的特征。更合适的方法是在网络的某个更靠前些的层的特征上训练SVM分类器。

（5）激活函数的选择
DL网络中一个重要的环节就是激活函数的选取，正是它使得网络具有了非线性。本部分将介绍几个常用的激活函数并给出一些选择建议。
a) sigmoid：它的数据表达式是
，输入是实数，输出在（0,1）之间，非常小的负数对应的输出为0，非常大的正数对应的输出为1。sigmoid函数得到广泛应用是因为它很好的模仿了神经元从未激活到完全激活的过程。现今，sigmoid函数应用得很少，主要是因为它存在如下两个缺陷：
1）sigmoid容易饱和，导致梯度消失。

当神经元的激活值为0或者1时，处于这一区域的梯度为0,而在BP算法中，这种局部梯度会使得梯度消失，从而导致该神经元无信号经过。此外，在初始化sigmoid神经元的权重时，也需要注意防止饱和，若初始化权重很大，很多的神经元会很快就达到饱和，从而导致这些神经元不能再学习。
2）sigmoid的输出并不是中心化的。这会影响梯度下降的波动状态，因为若进入一个神经元的输入总是正数，那么关于该神经元的权重梯度计算时要么都是正的或者都是负的，这会导致利用梯度更新权值时发生所不想要的之字形波动。不过若将一批数据的梯度加起来再进行权值更新时，权值就能取到正负值，某种程度上缓解了这一问题。

b）tanh(x)：tanh函数的输出范围是(-1,1)。和sigmoid函数类似，它也存在饱和问题，但和sigmoid不同的是它的输出是中心化的。所以实际中更倾向使用tanh函数。

c)ReLU：改进的线性神经单元在过去几年深受欢迎，它的激活函数是f(x)=max(0,x)，即阈值设为0，如下图所示：

ReLU有如下优缺点：
1.不同于sigmoid和tanh函数操作的复杂性，ReLU只需要对激活值矩阵设置0阈值即可，同时ReLU还不存在饱和问题；
2.ReLU极大地提高了随机梯度的收敛速度；
3.其缺陷是比较脆弱且在训练过程中容易发生神经元的死亡的现象。比如，当一个较大的梯度经过ReLU时会使得相应的权值更新幅度很大，导致神经元在其他任何数据点上不再被激活。这样就导致流经此神经元的梯度随后总是为0，即此单元在训练过程中不再起作用。
d）Leaky ReLU：它是试图解决ReLU单元死亡问题的一种方法。

此时x<0时函数值不再为0，而是取一个很小的负值，即当x<0时f(x)=ax，当x>0时f(x)=x，a是一个很小的常数。一些研究在此激活函数上取得了很大的成功，但每次试验结果很难保持一致。
e)PReLU/RReLU：它和Leaky ReLU的区别在于常数a不再是确定的，而是学习到的。而RReLU中的a是从一个给定范围内随机抽取得到的，在测试阶段才固定下来。

He等人宣称PReLU是DL在Imagenet的分类任务上超越人类水平的关键所在。而在kaggle的一个比赛中，提出RReLU因为其随机性而可以缓解过拟合问题。
在[]中作者对这几种激活函数在不同数据集上的性能进行了比较，结果如下：

需要注意的是表中a的实际取值是1/a。同时我们也可以看到，ReLU在这三个数据集上并不总是最好的，而当Leaky ReLU的a取值较大时能达到更高的精确率。PReLU在小数据集上容易发生过拟合，但性能仍然比ReLU要好。RReLU[4]在NDSB数据集上是性能最好的，这也说明了RReLU确实能够缓解过拟合，因为NDSB数据集比其他两个数据集都小。
总之，ReLU的三种变形都比其本身性能要好。

（6）正则化
可以通过以下几种方法来控制神经网络的容量，从而避免过拟合：
1）L2范式：通过直接在目标函数中加上所有参数值的平方来补偿过拟合，即对每个参数w，都在目标函数中加上对应项，这里的系数λ是衡量正则化强度的量。之所以含有系数1/2是为了在对w求梯度时将结果简化为λw。L2范式对含有多峰值的权值的极大补偿有着较为直观的解释并且它倾向于分散权重矢量。
2）L1范式：对于每个权重w，在目标函数中增加对应项λ|w|。也可以将L1和L2范式整合起来使用，即，也被称为弹性网络正则化。L1范式有趣之处在于它使得权重矢量在优化阶段变得稀疏，即经过L1正则化后神经元只利用那些总要的输入并且对输入噪声不敏感。而L2正则化后的权重矢量经常是分散的且值很小。在实际应用中，如果你不是很关心显示特征的选取，L2范式是个不错的选择。
3）最大约束范式：它是对每个神经元的权重矢量最大值的绝对值给予限制并且利用设计好的梯度下降来加强这一限制。实际操作中先是对参数进行正常的更新，然后再通过钳制每个神经元的权重矢量使它满足c的典型取值是3或者4。它的一个引人注意额特性是即使学习率很大，网络也不会发生膨胀，因为权重更新都受到了限制。
4）dropout[5]：它是一个十分高效、简单的正则化方法。在训练过程中，dropout可看做是在整个网络中抽取一个子网络，并基于输入数据训练这个子网络，但是这些子网络并不是独立的，因为他们的参数共享。

在测试阶段不再运用dropout，而是基于对所有指数级数目的子网络的预测求平均来进行评估。实际操作中dropout比率经常取为p=0.5，不过在验证集上这一数值是可以调整的。

（7）对图像的深入认识
利用上述方法基本可以得到令人满意的网络设置，而且在训练阶段，你也可以抽取一些图像处理结果来实时观察网络训练的效果。
1）我们知道，学习率的调整是十分敏感的，从下图可以看到，学习率很高时会产生一个十分奇怪的损失曲线。一个很低的学习率即使经过多轮训练你的训练误差依然降低得十分缓慢，而很高的学习率在开始训练时误差会降低得很快，但却会停留在局部最小点，这样就会导致你的结果无法令人满意。一个好的学习率，其对应的误差曲线应用具有很好的平滑性并最终使得网络达到最优性能。
2）我们放大误差曲线看看会发现什么。轮即是在训练数据上完成一次训练，所以在每轮中有很多小批次数据，画出每个小批次数据的分类误差，如图二。和图一类似，若误差曲线很接近线性，这意味着学习率很低，若其降低幅度很小，意味着学习率可能太大了。曲线的宽度和批次大小有关，若宽度很宽，意味着批次大小相差很大，这时应该尽量减少批次间的差别。
3）另外我们还可以观察精确率曲线，如图三。红线是训练精确率，绿线是验证集精确率。当验证集的精确率开始收敛时，红绿曲线之间的差别可以表现网络训练是否高效。若差别较大，即在训练集上的精确率很高而在验证集上的精确率很低，这意味着你的模型在训练集上出现了过拟合，这时就应该增强对网络的正则化。然而，若两者间差别很小但精确率很低，这意味着你的模型的学习能力很弱，这时就需要增大网络的容量来提高其学习能力。

（8）网络整合[6]
ML中，训练多个学习器再将他们组合起来是达到最好性能的一种方法。本部分介绍几种整合方法：
1）相同模型，不同初始化方法。利用交叉验证来确定最优超参，然后利用这些超参来训练初始化不同的多个模型。这种方法的缺陷在于只有初始化这一个变化。
2）在交叉验证接过程中选取几个性能排在最前边的模型进行组合。先利用交叉验证确定最优超参，然后选取几个性能较好的模型进行整合。这种方法提高了多样性，但可能会将次优模型也包含在内了。实际应用中，这种方法很容易实现，因为它不需要在交叉验证后对模型进行额外的再训练。你也可以在caffe model zoo里选取几个模型进行组合训练。
3）在单个模型里设置不同的检测点。若训练成本太高，在资源受限的情况下，不得不在单个模型训练过程中选取检测点，并且利用这些结果进行组合。很显然，这种方法也缺乏多样性，但在实际应用中也能取得一定的成功。但它的成本很低。
4）一些现实例子。若你的视觉处理任务和图像的高层语义相关，比如利用静态图像来进行事件识别，更好的一种途径是在不同的数据源上训练多个模型来提取不同且互补的深度特征。比如在ICCV2015的文化活动识别挑战中，我们利用五种不同的深度网络在Imagenet，pace database和竞赛组织者提供的数据集上进行训练，这样就获取了五种互补的深层特征并将它们看作是多视角数据。将前期的融合和后期的融合结合起来[]，我们取得了该比赛的第二名。在[7]提出了堆积NN组合成更深网络的思路。

在现实应用中，数据集经常会出现各类别间不均衡的现象，即一些类别有很多训练样例，而另一些类别的样例很少，一如[]中所谈及的，在这种不均衡数据集上训练深层网络会使得网络性能很差。处理这一问题的最简单的方法就是对数据集进行上采样和下采样[8]。另外一种方法是[7]中提出的特殊剪切处理，因为原始的有关文化活动的图像及不均衡，我们就只对那些含有较少样例的类别图像进行剪切，这样既可以为数据提供多样性，也能缓解数据不均衡问题。除此之外，也可以通过调整微调策略来克服数据源的不均衡，比如，可以将数据源分成两份，一份是那些含有很多样例的类别，另一份是那些只有少量样例的类别。在这两份数据集里，不均衡问题也得到了缓解。在自己的数据集上微调网络时，先在第一份数据集上微调，然后再在第二份数据集上微调。

=============================================================================================

Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

主要以下面八个部分展开介绍：

mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Networks; 4) some tips during training; 5) selections of activation functions; 6) diverse regularizations; 7)some insights found from figures and finally 8) methods of ensemble multiple deep networks.

1，数据扩增

2.预处理数据

3.初始化网络

4，在训练中的一些tips

5,合理的选择激活函数

6.多种正则化

7，从实验图和结果发现insights

8，如何集合多个网络

依次介绍八种方法：

一、data augmentation

1. th additiarhorizontally flipping（水平翻转）, random crops(随机切割) and color jittering(颜色抖动). Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

2、Krizhevsky et al. [1] proposed fancy PCA。you can firstly perform PCA on the set of RGB pixel values throughout your training images. add the following quantity to each RGB image pixel (i.e., $I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T$ ): $bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T$ where, $bf{p}_i$ and are the -th eigenvector and eigenvalue of the covariance matrix of RGB pixel values, respectively, and is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. 。。

二、Pre-processing

1、The first and simple pre-processing approach is zero-center the data, and then normalize them。

code：

>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize

2、re-processing approach similar to the first one is PCA Whitening.

>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

上面两种方法：these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

三、初始化-Initialization

1.All Zero Initialization---假如全部权值都设为0或相同的数，就会计算相同梯度和相同的参数更新，即没有对称性

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

2、Initialization with Small Random Numbers

依据：仍然期望各参数接近0，符合对称分布，选取

来设各个参数，但最后的效果没有实质性提高。

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like

, where

is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

3、Calibrating the Variances 调整各个方差，每个细胞源输出的方差归到1，通过除以输入源的个数的平方

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out thatyou can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

4、Current Recommendation 当前流行的方法。是文献[4]神经元方差设定为2/n.n是输入个数。所以对权值w的处理是，正态分布上的采样数乘以sqrt(2.0/n)

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be as:

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.

四、训练过程

1、Filters and pooling size. 滤波子大小和尺化大小的设定

the size of input images prefers to be power-of-2, such as32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ asmall filter (e.g.,

) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e.,

filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common usedpooling size is of

2、Learning rate. 建议学习率，gradients除以batch size 。在没有改变mini batch时，最好别改变lr.

开始lr设定为0.1~~利用validation set来确定Lr,再每次除以2或5

In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

3、Fine-tune on pre-trained models，微调和预训练，直接利用已经公布的一些模型：Caffe Model Zoo and VGG Group。

结合这些模型用于新的数据集上，需要fine-tune，需要考虑两个重要因子：数据集大小和与原数据的相似度。

For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. 微调分两种情况：第一种：如果新数据集少，且分布类似预训练的库（现实是残酷的，不太可能），只需要调整最后一层的线性分类器。

If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate.

如果有很多数据，就用小的LR调整模块的最后几层

However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance.

如果你的数据与预模型不同，但数量充足，用一个小的Lr对很多层进行调整

However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

假如，数据少且不同与源数据模型，这就会很复杂。仅仅靠训练分类器肯定不行。也许可以对网络中前几层的激活层和特征层做SVM分类器训练。

五、 selections of activation functions;合理选择激活函数

One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.

本图取之：http://cs231n.stanford.edu/index.html

几种激活函数：

Sigmoid：

The sigmoid non-linearity has the mathematical form $sigma(x)=1/(1+e^{-x})$ . It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).在最大阈值1时，就达到饱和--Saturated.

sigmoid已经失宠，因为他的两个缺点：

（1）.Sigmoids saturate and kill gradients. 由于饱和而失去了梯度

因为在when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero。看图就知道，整个曲线的倾斜角度，在两端倾斜角都是平的。

关键的问题在于this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective。这样就会因为local gradient 太小，而it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights

and recursively to its data. 影响到梯度，导致没有信号能通过神经元传递给权值。而且还需要小心关注初始权值，one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.因为初始的权值太大，就会让神经元直接饱和，整个网络难以学习。

(2) .Sigmoid outputs are not zero-centered. 不是以0为中心

This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that isnot zero-centered. This has implications on the dynamics duringgradient descent, because if the data coming into a neuron is always positive(e.g., element wise in ), then the gradient on the weights will during back-propagation become either all be positive, or all negative(depending on the gradient of the whole expression ).

这样在后几层网络中接受的值也不是0中心，这样在动态梯度下降法中，如果进入nueron中的数据都是正的，那么整个权值梯度w要不全为正，或者全为负（取决于f的表达形式）。

This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

这回导致锯齿状的动态梯度，但如果在一个batch数据中将梯度求和来更新权值，有可能会相互抵消，从而缓解上诉的影响。这笔饱和激活带来的影响要轻太多了！

Tanh(x)

The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

tanh的作用是将真个实数数据放到了[-1,1]之间，他的激活依旧是饱和状态，但他的输出是0中心。

Rectified Linear Unit

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function

, which is simply thresholded at zero.

Relu 有一些优点和缺点：

There are several pros and cons to using the ReLUs:

(Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.

运算简单，非指数形式，切不会饱和
(Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

已被证明可以加速随机梯度收敛，被认为是由于其线性和非饱和形式（有待考证）
(Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

缺点：Relu Unit在训练中容易die，例如一个大的梯度流过nueron，会导致部分unit一直为0，例如，lr设置很高时，你的网络又40%的neuro未被激活。

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when , a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes if and if , where is a small constant. Some people report success with this form of activation function, but the results are not always consistent. 修改了x<0部分，设定为了一个常数a,

后续又出来了一连串的RELU的修改：

ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, is learned and for Leaky ReLU is fixed. For RReLU, $alpha_{ji}$ is a random variable keeps sampling in a given range, and remains fixed in testing.

PReLU的a是学习得到，RReLU的a是随机采样变换。在测试中是固定。

B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.

文献给出了各个激活函数的表现：

From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slope will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100. In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, He et al. also reported similar conclusions in [4]

[4]K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. InICCV, 2015.

Regularizations

There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight in the network, we add the term $frac{1}{2}lambda w^2$ to the objective, where is the regularization strength. It is common to see the factor of $frac{1}{2}$ in front because then the gradient of this term with respect to the parameter is simply instead of . The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors.
L1 regularization is another relatively common form of regularization, where for each weight we add the term to the objective. It is possible to combine the L1 regularization with the L2 regularization: (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector of every neuron to satisfy $parallel vec{w} parallel_2 <c$ . Typical values of are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.
Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [6] that complements the other methods (L1, L2, maxnorm). During training, dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section). In practice, the value of dropout ratio is a reasonable default, but this can be tuned on validation data.

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(Jun):1929−1958, 2014.

七：ome insights found from figures and finally训练中重要的图

1.As we have known, the learning rate is very sensitive. From Fig. 1 in the following, a very high learning rate will cause a quite strange loss curve. A low learning rate will make your training loss decrease very slowly even after a large number of epochs. In contrast, a high learning rate will make training loss decrease fast at the beginning, but it will also drop into a local minimum. Thus, your networks might not achieve a satisfactory results in that case. For a good learning rate, as the red line shown in Fig. 1, its loss curve performs smoothly and finally it achieves the best performance.

不同的LR设定会带来不同的Loss效果，需要合理的选择一个lr

Now let’s zoom in the loss curve. The epochs present the number of times for training once on the training data, so there are multiple mini batches in each epoch. If we draw the classification loss every training batch, the curve performs like Fig. 2. Similar to Fig. 1, if the trend of the loss curve looks too linear, that indicates your learning rate is low; if it does not decrease much, it tells you that the learning rate might be too high. Moreover, the “width” of the curve is related to the batch size. If the “width” looks too wide, that is to say the variance between every batch is too large, which points out you should increase the batch size.

每一个epoch中又多个batchsize的循环。下图纵坐标是loss,横坐标是epoch，每一个epoch纵向蓝色直线就是一个循环epoch内每个Batchsize对应的Loss。如果看起来太线性说明Lr太低，如果没有降低太多，说明Lr太高。整个蓝色直线的width与batch size的大小有关，如果她看着太宽了，可能就需要增加batch size, 这样就会降低width==num/batchsize.

Another tip comes from the accuracy curve. As shown in Fig. 3, the red line is the training accuracy, and the green line is the validation one. When the validation accuracy converges, the gap between the red line and the green one will show the effectiveness of your deep networks. If the gap is big, it indicates your network could get good accuracy on the training data, while it only achieve a low accuracy on the validation set. It is obvious that your deep model overfits on the training set. Thus, you should increase the regularization strength of deep networks. However, no gap meanwhile at a low accuracy level is not a good thing, which shows your deep model has low learnability. In that case, it is better to increase the model capacity for better results.再说test data与 validation data之间的关系，gap太大，就导致了overfitting。

八：Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use are a kind of state-of-the-art learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks. In practical applications, especially challenges or competitions, almost all the first-place and second-place winners used ensemble methods.

Here we introduce several skills for ensemble in the deep learning scenario.

Same model, different initialization. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.
Top models discovered during cross-validation. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g., 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it does not require additional retraining of models after cross-validation. Actually, you could directly select several state-of-the-art deep models from Caffe Model Zoo to perform ensemble.
Different checkpoints of a single model. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.
Some practical examples. If your vision tasks are related to high-level image semantic, e.g., event recognition from still images, a better ensemble method is to employ multiple deep models trained on different data sources to extract different and complementary deep representations. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by the competition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in [7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work,[9] presented the Stacked NN framework to fuse more deep networks at the same time.

九、混杂

In real world applications, the data is usually class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.

类别不平均问题：一些类拥有大量的训练数据，一类数据量有限

As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks.

不平衡的训练数据对整个网络有负面效果

For this issue, the simplest method is to balance the training data by directly up-sampling and down-sampling the imbalanced data, which is shown in [10].

一种解决方法：直接上采样和下采样数据，

Another interesting solution is one kind of special crops processing in our challenge solution [7]. Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem.

另一种:解决办法如文献{7}中，采用剪切的办法

[7]X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.

In addition, you can adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious. At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

第三种方法：采用fine-tuning策略，将数据分割为两个部分，大数据集合小数据集，先微调大的类，再微调小类。

[10] P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015.

References & Source Links

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012
A Brief Overview of Deep Learning, which is a guest post by Ilya Sutskever.
CS231n: Convolutional Neural Networks for Visual Recognition of Stanford University, held by Prof. Fei-Fei Li and Andrej Karpathy.
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In ICCV, 2015.
B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(Jun):1929−1958, 2014.
X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Chapman & HallCRC/, 2012. (ISBN 978-1-439-830031)
M. Mohammadi, and S. Das. S-NN: Stacked Neural Networks. Project in Stanford CS231n Winter Quarter, 2015.
P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015.

posted on 2017-12-13 19:12 塔上的樹阅读(368) 评论(0) 收藏举报

刷新页面返回顶部

塔上的樹