论文阅读| Improving neural networks by preventing co-adaptation of feature detectors

Improving neural networks by preventing co-adaptation of feature detectors

dropout的发展

在2012年,Hinton在其论文《Improving neural networks by preventing co-adaptation of feature detectors》中提出Dropout。当一个复杂的前馈神经网络被训练在小的数据集时,容易造成过拟合。为了防止过拟合,可以通过阻止特征检测器的共同作用来提高神经网络的性能。

在2012年,Alex、Hinton在其论文《ImageNet Classification with Deep Convolutional Neural Networks》中用到了Dropout算法,用于防止过拟合。并且,这篇论文提到的AlexNet网络模型引爆了神经网络应用热潮,并赢得了2012年图像识别大赛冠军,使得CNN成为图像分类上的核心算法模型。

随后,又有一些关于Dropout的文章《Dropout:A Simple Way to Prevent Neural Networks from Overfitting》、《Improving Neural Networks with Dropout》、《Dropout as data augmentation》。

摘自:dropout层_深度学习中Dropout原理解析_weixin_39909212的博客-CSDN博客

论文部分翻译

当一个大型前馈神经网络在一个小型训练集上进行训练时,它通常在保留的测试数据上表现不佳。 通过在每个训练案例中随机省略一半的特征检测器,可以大大减少这种“过拟合”。 这可以防止复杂的协同适应,其中特征检测器仅在其他几个特定特征检测器的上下文中才有帮助。 取而代之的是,每个神经元学习检测一个特征,该特征通常有助于产生正确的答案,因为它必须在其中运行的内部上下文组合多种多样。 随机“dropout”对许多基准任务进行了重大改进,并为语音和对象识别创造了新的记录。

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This “overfitting” is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random “dropout” gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

现在有一正反馈,人工神经网络使用其输入和输出之间的非线性“隐藏”单元层。通过调整这些隐藏单元之间的传入连接上的权重,它可以学习到在给定输入向量时预测出正确输出的特征检测器。如果输入和输出之间的关系是复杂的,并且网络中有足够的隐藏单元用以准确建模,那么通常会有许多不同的权重设置用以完美地为训练集建模,尤其是如果只有有限数量的标记训练数据的情况。每个权重向量都会对被搁置的测试数据做出不同的预测,并且几乎所有的预测在测试数据上的都比在训练数据上的表现糟糕,这是因为特征检测器已经被调谐,在训练集上可以很好地合作,而在测试集上则不会。

A feedforward, artificial neural network uses layers of non-linear “hidden” units between its inputs and its outputs. By adapting the weights on the incoming connections of these hidden units it learns feature detectors that enable it to predict the correct output when given an input vector (1). If the relationship between the input and the correct output is complicated and the network has enough hidden units to model it accurately, there will typically be many different settings of the weights that can model the training set almost perfectly, especially if there is only a limited amount of labeled training data. Each of these weight vectors will make different predictions on held-out test data and almost all of them will do worse on the test data than on the training data because the feature detectors have been tuned to work well together on the training data but not on the test data.

可以通过使用“dropout”以减小过拟合,从而防止在训练集上的复杂的共适应。在每个训练案例的每个呈现上,每个隐藏单元被随机地从网络中以0.5的概率省略,也就是说,隐藏单元不能依赖于其他存在着的隐藏单元。另一种理解dropout过程的效率化方式是,用神经网络进行模型平均。一种良好的减少训练集上的误差的方式是平均由大量不同网络产生的预测。这样做的标准方式是训练大量单独的网络,然后将每个网络应用于测试数据,但在训练和测试期间,这在计算上的操作无疑是昂贵的。随机dropout使在合理期限内训练庞大且不同的网络成为可能。几乎可以肯定的是,每个训练用例的每个表示都对应一个不同的网络,但是这些网络对于存在着的隐藏单元具有相同的权重。

Overfitting can be reduced by using “dropout” to prevent complex co-adaptations on the
training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present. Another way to view the dropout procedure is as a very efficient way of performing model averaging with neural networks. A good way to reduce the error on the test set is to average the predictions produced by a very large number of different networks. The standard way to do this is to train many separate networks and then to apply each of these networks to the test data, but this is computationally expensive during both training and testing. Random dropout makes it possible to train a huge number of different networks in a reasonable time. There is almost certainly a different network for each presentation of each training case but all of these networks share the same weights for the hidden units that are present.

我们使用标准的随机梯度下降用以训练dropout神经网络在训练案例上的小批量集合,但同时也修改了用来防止权重增长过大的惩罚项。不对整个权重向量的平方长(即L2范数)施加惩罚项,相反地,对每个单独的隐藏单元的传入权重向量,我们在它们的L2范数上设置一个上限。如果一个权重更新的过程违反了这个约束,我们将隐藏单元的权重通过分配重新进行正则化。无论提议的权重更新有多大,使用约束来代替惩罚的这种方式,可以阻止权重增长过大。这样就使让一个非常大的,且在学习过程中会衰减的学习率作为训练的开始成为可能,相较于从小的权重开始且使用学习率较小的方式,这种方式对权重空间的搜索更为彻底。

We use the standard, stochastic gradient descent procedure for training the dropout neural networks on mini-batches of training cases, but we modify the penalty term that is normally used to prevent the weights from growing too large. Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division. Using a constraint rather than a penalty prevents weights from growing very large no matter how large the proposed weight-update is. This makes it possible to start with a very large learning rate which decays during learning, thus allowing a far more thorough search of the weight-space than methods that start with small weights and use a small learning rate.

在测试时,我们使用包含所有隐藏单元的“均值网络”,但将它们的输出权重减半,用以补偿两倍于激活的实际情况。在实践中,这与在大量dropout网络上的表现非常相似。在具有N个单元的单个隐藏层和用于计算类标签概率的“softmax”输出层中,使用均值网络可完全等同于通过所有2N个可能的网络预测的标签上取概率分布的集合均值。假设dropout网络并非都做出相同的预测,则均值网络的预测值保证给出正确的应答分配,比单独且多个dropout网络分配更高的log概率。同样地,对于线性输出单元的回归,均值网络的平方误差总是优于dropout的平方误差的均值。

At test time, we use the “mean network” that contains all of the hidden units but with their outgoing weights halved to compensate for the fact that twice as many of them are active. In practice, this gives very similar performance to averaging over a large number of dropout networks. In networks with a single hidden layer of N units and a “softmax” output layer for computing the probabilities of the class labels, using the mean network is exactly equivalent to taking the geometric mean of the probability distributions over labels predicted by all 2N possible networks. Assuming the dropout networks do not all make identical predictions, the prediction of the mean network is guaranteed to assign a higher log probability to the correct answer than the mean of the log probabilities assigned by the individual dropout networks (2). Similarly, for regression with linear output units, the squared error of the mean network is always better than the average of the squared errors of the dropout networks.

最初我们探索了dropout使用MNIST的有效性,(MINIST:一个被广泛使用的基准机器学习算法)。它包含了60,000张28*28大小的单独的手写数字图片以及10,000张测试图片。 通过用变换后的图像增强训练数据,或将有关空间转换的知识连接到卷积神经网络,或使用生成的预训练,不带标签地,从训练图片中提取有用的特征,测试集的性能可大大提高。在不使用任何这些技巧的情况下,标准前馈神经网络的最佳表现结果是在测试集上产生160个错误。在每个隐藏单元的输入权重的单独的L2约束上,可通过使用50%的dropout减少到130个左右的错误,并通过减少随机20%的像素值进一步将错误减少到大约110个。(figure 1)

We initially explored the effectiveness of dropout using MNIST, a widely used benchmark for machine learning algorithms. It contains 60,000 28x28 training images of individual hand written digits and 10,000 test images. Performance on the test set can be greatly improved by enhancing the training data with transformed images (3) or by wiring knowledge about spatial transformations into a convolutional neural network (4) or by using generative pre-training to extract useful features from the training images without using the labels (5). Without using any of these tricks, the best published result for a standard feedforward neural network is 160 errors on the test set. This can be reduced to about 130 errors by using 50% dropout with separate L2 constraints on the incoming weights of each hidden unit and further reduced to about 110 errors by also dropping out a random 20% of the pixels (see figure 1).

动机论

虽然直观上看dropout是ensemble在分类性能上的一个近似,然而实际中,dropout毕竟还是在一个神经网络上进行的,只训练出了一套模型参数。那么他到底是因何而有效呢?这就要从动机上进行分析了。论文中作者对dropout的动机做了一个十分精彩的类比:

在自然界中,在中大型动物中,一般是有性繁殖,有性繁殖是指后代的基因从父母两方各继承一半。但是从直观上看,似乎无性繁殖更加合理,因为无性繁殖可以保留大段大段的优秀基因。而有性繁殖则将基因随机拆了又拆,破坏了大段基因的联合适应性。

但是自然选择中毕竟没有选择无性繁殖,而选择了有性繁殖,须知物竞天择,适者生存。我们先做一个假设,那就是基因的力量在于混合的能力而非单个基因的能力。不管是有性繁殖还是无性繁殖都得遵循这个假设。为了证明有性繁殖的强大,我们先看一个概率学小知识。

比如要搞一次恐怖袭击,两种方式:

集中50人,让这50个人密切精准分工,搞一次大爆破。
将50人分成10组,每组5人,分头行事,去随便什么地方搞点动作,成功一次就算。
哪一个成功的概率比较大? 显然是后者。因为将一个大团队作战变成了游击战。

那么,类比过来,有性繁殖的方式不仅仅可以将优秀的基因传下来,还可以降低基因之间的联合适应性,使得复杂的大段大段基因联合适应性变成比较小的一个一个小段基因的联合适应性。

dropout也能达到同样的效果,它强迫一个神经单元,和随机挑选出来的其他神经单元共同工作,达到好的效果。消除减弱了神经元节点间的联合适应性,增强了泛化能力。

个人补充一点:那就是植物和微生物大多采用无性繁殖,因为他们的生存环境的变化很小,因而不需要太强的适应新环境的能力,所以保留大段大段优秀的基因适应当前环境就足够了。而高等动物却不一样,要准备随时适应新的环境,因而将基因之间的联合适应性变成一个一个小的,更能提高生存的概率。
参考理解dropout_雨石-CSDN博客_dropout

posted @ 2021-06-09 09:58  snackj  阅读(588)  评论(0)    收藏  举报