he concept of dropout and the scaling factor used during training and evaluation can be a bit confusing, so let's clarify the process and its implications.

Dropout 的概念以及训练和评估过程中使用的比例因子可能有点令人困惑,所以让我们澄清一下这个过程及其含义。

1. 训练阶段 Dropout

  1. Random Neuron Dropping: During each forward pass in the training phase, each neuron is dropped (set to zero) with a probability \(p\).随机神经元丢弃: 在训练阶段的每次前向传递期间,每个神经元都会以 \(p\) 的概率被丢弃(设置为零)。

  2. Scaling Neuron Outputs: To maintain the expected value of the activations, the remaining neurons that are not dropped are scaled by a factor of \(\frac{1}{1-p}\). This scaling ensures that the sum of the activations remains approximately the same on average, despite some neurons being dropped.缩放神经元输出: 为了维持激活的预期值,未丢弃的剩余神经元将按 \(\frac{1}{1-p}\) 因子缩放。尽管删除了一些神经元,这种缩放确保激活总和平均保持大致相同。

    Mathematically, if \(\mathbf{h}\) is the activation vector, and \(\mathbf{m}\) is a mask vector with elements drawn from a Bernoulli distribution with parameter \(1-p\) (i.e., \(m_i \sim \text{Bernoulli}(1-p)\)), the dropout output \(\mathbf{\tilde{h}}\) is given by:从数学上讲,如果 \(\mathbf{h}\) 是激活向量,并且 \(\mathbf{m}\) 是一个掩码向量,其元素是从具有参数 \(1-p\) 的伯努利分布中提取的(即 \(m_i \sim \text{Bernoulli}(1-p)\) 由下式给出:

    \[\mathbf{\tilde{h}} = \frac{\mathbf{h} \odot \mathbf{m}}{1-p} \]

    where \(\odot\) denotes element-wise multiplication.其中 \(\odot\) 表示逐元素乘法。

2. 验证阶段 Dropout During Evaluation

  1. All Neurons Active: During evaluation (or testing), dropout is not applied, meaning no neurons are dropped. All neurons are active, and the network operates as a fully connected network without any random dropping.所有神经元均处于活动状态: 在评估(或测试)期间,不应用 dropout,这意味着没有神经元被丢弃。所有神经元都是活跃的,并且网络作为完全连接的网络运行,没有任何随机丢弃。

  2. No Scaling Needed: Because all neurons are active, the scaling factor \(\frac{1}{1-p}\) applied during training is no longer needed. The activations are not altered during evaluation; they are taken as they are.无需缩放: 由于所有神经元都处于活动状态,因此不再需要在训练期间应用的缩放因子 \(\frac{1}{1-p}\) 。评估过程中激活不会改变;他们被当作他们本来的样子。

Why the Module Computes an Identity Function During Evaluation为什么模块在评估期间计算恒等函数

During evaluation, the network behaves as if dropout was never applied:在评估过程中,网络的行为就像从未应用 dropout 一样:

  • Identity Function: The "identity function" here means that the output of the layer during evaluation is simply the raw activations computed from the inputs without any additional modifications (like dropout or scaling). Essentially, each layer computes its outputs as it normally would without dropout.身份功能: 这里的“恒等函数”意味着评估期间层的输出只是根据输入计算出的原始激活,没有任何额外的修改(例如丢失或缩放)。本质上,每一层都像平常一样计算其输出,而不会丢失。

  • Consistency with Training: The reason why no additional scaling is needed during evaluation is because the scaling was already incorporated during training to account for the dropped neurons. During training, each active neuron's output was scaled up to ensure the total output remained consistent despite dropout. Therefore, during evaluation, when no neurons are dropped, the activations naturally align with what the model expects, as if the dropout and scaling never happened.与培训的一致性: 评估期间不需要额外缩放的原因是因为在训练期间已经合并了缩放以考虑丢失的神经元。在训练过程中,每个活跃神经元的输出都会按比例放大,以确保总输出在丢失时保持一致。因此,在评估过程中,当没有神经元被丢弃时,激活自然会与模型的预期一致,就好像丢弃和缩放从未发生过一样。

Intuitive Explanation直观的解释

Think of dropout as a way to train multiple smaller networks within the larger network. During training, each smaller network sees a slightly different subset of neurons, but on average, they all contribute to learning robust features. During evaluation, we want to use the full network's capacity without any randomness, and the network's weights have been adjusted (during training) to handle this full capacity appropriately.将 dropout 视为在较大网络中训练多个较小网络的一种方法。在训练过程中,每个较小的网络都会看到稍微不同的神经元子集,但平均而言,它们都有助于学习鲁棒的特征。在评估过程中,我们希望使用完整网络的容量而没有任何随机性,并且网络的权重已在训练期间进行调整以适当地处理该完整容量。

In summary, during evaluation, the module simply computes the identity function because the dropout (and corresponding scaling) was a training-time mechanism to improve generalization, and the trained weights naturally accommodate the fully active network without requiring further adjustment.总之,在评估过程中,该模块只是计算恒等函数,因为 dropout(和相应的缩放)是一种提高泛化能力的训练时间机制,并且训练后的权重自然地适应完全活跃的网络,而不需要进一步调整。

The proportional amplification of each active neuron's output during the training process when using dropout is designed to ensure that the expected value of the activations remains consistent. This consistency helps the network maintain a stable learning process despite the randomness introduced by dropout.

使用 dropout 时,在训练过程中每个活动神经元的输出按比例放大,旨在确保激活的总期望值保持一致。尽管 dropout 带来了随机性,但这种一致性有助于网络保持稳定的学习过程。

2. Principle Behind Proportional Amplification比例放大背后的原理

To understand why this amplification is necessary, let's dive into the principles of dropout and the statistical reasoning behind it.为了理解为什么这种放大是必要的,让我们深入研究 dropout 的原理及其背后的统计推理。

  1. Random Dropping of Neurons: During training, each neuron is independently dropped out with a probability \(p\). This means each neuron has a \(1-p\) chance of remaining active.神经元的随机丢弃: 在训练期间,每个神经元都会以 \(p\) 的概率独立退出。这意味着每个神经元都有 \(1-p\) 机会保持活跃。

  2. Expected Value of Neurons: If we do not scale the output of the active neurons, the total expected activation of a layer would be reduced due to the dropout. To see why, consider a layer with activations \(h_1, h_2, \ldots, h_n\). If each neuron is dropped with probability \(p\), the expected activation for each neuron \(h_i\) would be:神经元的期望值: 如果我们不缩放活动神经元的输出,则层的总预期激活将由于丢失而减少。要了解原因,请考虑具有激活 \(h_1, h_2, \ldots, h_n\) 的层。如果每个神经元以 \(p\) 概率被丢弃,则每个神经元 \(h_i\) 的预期激活将为:

    \[\mathbb{E}[\text{output of } h_i] = (1-p) h_i \]

  3. Scaling to Maintain Expected Value: To counteract this reduction, we scale up the output of each active neuron by \(\frac{1}{1-p}\). This ensures that the expected value of the activations remains the same as it would be without dropout. Mathematically, if \(h_i\) is active with probability \(1-p\) and is scaled by \(\frac{1}{1-p}\), the expected value of the scaled activation is:扩展以维持预期值: 为了抵消这种减少,我们将每个活动神经元的输出放大 \(\frac{1}{1-p}\) 。这确保了激活的预期值保持与没有 dropout 的情况相同。从数学上讲,如果 \(h_i\) 以概率 \(1-p\) 处于活动状态并按 \(\frac{1}{1-p}\) 缩放,则缩放激活的预期值为:

    \[\mathbb{E}\left[\frac{h_i}{1-p}\right] = h_i \]

    This keeps the overall expected activation level consistent.这使总体预期激活水平保持一致。

Detailed Explanation with Example详细举例说明

Consider a layer with three neurons producing activations \(h_1, h_2,\) and \(h_3\). Without dropout, the total activation output is:考虑一个具有三个神经元的层,产生激活 \(h_1, h_2,\)\(h_3\) 。在没有 dropout 的情况下,总激活输出为:

\[h_1 + h_2 + h_3 \]

With dropout, each neuron is independently dropped with probability \(p\). Suppose \(h_2\) is dropped and \(h_1\) and \(h_3\) remain active. Without scaling, the total activation would be reduced to:通过 dropout,每个神经元都会以 \(p\) 的概率独立丢弃。假设 \(h_2\) 被删除,而 \(h_1\)\(h_3\) 保持活动状态。如果不进行缩放,总激活将减少为:

\[h_1 + 0 + h_3 = h_1 + h_3 \]

To ensure the total expected activation remains the same, we scale the output of each active neuron by \(\frac{1}{1-p}\). If \(h_1\) and \(h_3\) are active and scaled, their contributions to the total activation become:为了确保总的预期激活保持不变,我们将每个活动神经元的输出缩放 \(\frac{1}{1-p}\) 。如果 \(h_1\)\(h_3\) 处于活动状态并且已缩放,则它们对总激活的贡献变为:

\[\frac{h_1}{1-p} + \frac{h_3}{1-p} \]

The expected value of each scaled activation remains the same as the original activation without dropout:每个缩放激活的期望值保持与原始激活相同,没有 dropout:

\[\mathbb{E}\left[\frac{h_i}{1-p}\right] = h_i \]

Thus, the expected total activation with dropout and scaling is:因此,带有丢失和缩放的预期总激活为:

\[\mathbb{E}\left[\frac{h_1}{1-p}\right] + \mathbb{E}\left[\frac{h_3}{1-p}\right] = h_1 + h_3 \]

Conclusion

The scaling of active neurons' outputs by \(\frac{1}{1-p}\) during training ensures that the expected sum of activations remains the same as it would be without dropout. This scaling is crucial because it allows the network to maintain consistent activation levels and gradients during training, facilitating stable learning. By preserving the expected activation values, dropout prevents any shifts in the magnitude of the outputs, thereby ensuring that the network trains effectively even with the randomness introduced by dropout.

在训练期间按 \(\frac{1}{1-p}\) 缩放活动神经元的输出可确保激活的总和期望值与没有 dropout 的情况保持相同。这种缩放至关重要,因为它允许网络在训练期间保持一致的激活水平和梯度,从而促进稳定的学习。通过保留预期的激活值,dropout 可以防止输出幅度的任何变化,从而确保即使在 dropout 引入随机性的情况下网络也能有效训练。

3. 具体的解释

ChatGPT:

I apologize for any confusion. Let's clarify the concept further with a more detailed and precise explanation.对于任何混乱,我深表歉意。让我们通过更详细和精确的解释来进一步阐明这个概念。

Without Dropout无辍学

When there is no dropout, each neuron's output directly contributes to the total activation output of the layer. For instance, if we have three neurons with activations \(h_1, h_2,\) and \(h_3\), the total activation output is:当没有dropout时,每个神经元的输出直接贡献于该层的总激活输出。例如,如果我们有三个具有激活 \(h_1, h_2,\)\(h_3\) 的神经元,则总激活输出为:

\[h_1 + h_2 + h_3 \]

With Dropout (and Scaling)带 Dropout(和缩放)

When using dropout during training, each neuron is dropped (set to zero) with a probability \(p\). This means each neuron has a \(1-p\) chance of being active. To understand why scaling is needed, let's go through the steps with and without scaling.当在训练期间使用 dropout 时,每个神经元都会以 \(p\) 的概率被丢弃(设置为零)。这意味着每个神经元都有 \(1-p\) 机会活跃。为了理解为什么需要缩放,我们来看看有和没有缩放的步骤。

Without Scaling不缩放

Suppose we apply dropout to the neurons, each with a probability \(p\). If \(h_2\) gets dropped (set to zero), the remaining active neurons \(h_1\) and \(h_3\) contribute to the total activation output:假设我们对神经元应用 dropout,每个神经元的概率为 \(p\) 。如果 \(h_2\) 被丢弃(设置为零),则剩余的活动神经元 \(h_1\)\(h_3\) 将贡献于总激活输出:

\[h_1 + 0 + h_3 = h_1 + h_3 \]

However, this output is only from two active neurons, and the total activation is lower than the expected total if all neurons were active. To correct this, we need to account for the fact that some neurons are randomly dropped.然而,该输出仅来自两个活跃神经元,并且总激活低于所有神经元都活跃时的预期总数。为了纠正这个问题,我们需要考虑一些神经元被随机丢弃的事实。

With Scaling具有缩放功能

To maintain the same expected total activation output as without dropout, we scale up the outputs of the active neurons by \(\frac{1}{1-p}\). This scaling ensures that the expected value of the activations remains consistent. Here’s how it works:为了保持与没有 dropout 时相同的预期总激活输出,我们将活动神经元的输出放大 \(\frac{1}{1-p}\) 。这种缩放可确保激活的预期值保持一致。它的工作原理如下:

  • Each neuron is active with a probability \(1-p\).每个神经元处于活动状态的概率为 \(1-p\)
  • When a neuron is active, its output is scaled by \(\frac{1}{1-p}\).当神经元处于活动状态时,其输出按 \(\frac{1}{1-p}\) 缩放。

So, for any individual neuron \(h_i\), the scaled output becomes:因此,对于任何单个神经元 \(h_i\) ,缩放输出变为:

\[\frac{h_i}{1-p} \]

When considering the expected value over many forward passes, each neuron’s contribution is scaled appropriately. Let’s calculate the expected value for a single neuron’s activation:当考虑多次前向传递的预期值时,每个神经元的贡献都会适当缩放。让我们计算单个神经元激活的期望值:

  • Probability that \(h_i\) is active: \(1-p\). \(h_i\) 处于活动状态的概率: \(1-p\)
  • When \(h_i\) is active, its output is scaled to \(\frac{h_i}{1-p}\).当 \(h_i\) 处于活动状态时,其输出将缩放为 \(\frac{h_i}{1-p}\)

The expected value of the scaled activation for \(h_i\) is: \(h_i\) 的缩放激活值的预期值为:

\[\mathbb{E}\left[\frac{h_i}{1-p}\right] = (1-p) \cdot \frac{h_i}{1-p} = h_i \]

Total Expected Activation with Dropout and Scaling包含 Dropout 和 Scaling 的总预期激活

Applying this to all neurons, the expected total activation output with dropout and scaling becomes:将其应用于所有神经元,带有丢失和缩放的预期总激活输出变为:

\[\mathbb{E}\left[\frac{h_1}{1-p}\right] + \mathbb{E}\left[\frac{h_2}{1-p}\right] + \mathbb{E}\left[\frac{h_3}{1-p}\right] = h_1 + h_2 + h_3 \]

This shows that the expected total activation output with dropout (and scaling) matches the total activation output without dropout.这表明带有 dropout(和缩放)的预期总激活输出与没有 dropout 的总激活输出相匹配。

Why Scaling Ensures Consistency为什么扩展可以确保一致性

By scaling the outputs of the active neurons during training, dropout ensures that the expected sum of the activations remains the same as if no dropout were applied. This consistency is crucial for maintaining stable learning dynamics and gradients during training, which helps the network to effectively learn and generalize.通过在训练期间缩放活动神经元的输出,dropout 可确保预期的激活总和与未应用 dropout 时保持相同。这种一致性对于在训练过程中保持稳定的学习动态和梯度至关重要,这有助于网络有效地学习和泛化。

In summary, the key idea is that the scaling factor \(\frac{1}{1-p}\) corrects for the reduction in the number of active neurons due to dropout, ensuring that the expected activation output remains consistent. This way, even though individual forward passes during training have some neurons dropped, the overall expected behavior of the network remains the same as if dropout were not applied.总之,关键思想是缩放因子 \(\frac{1}{1-p}\) 纠正由于 dropout 导致的活动神经元数量的减少,确保预期的激活输出保持一致。这样,即使训练期间的各个前向传递有一些神经元丢失,网络的总体预期行为仍然与未应用 dropout 时相同。

posted on 2024-06-18 20:11  Hello_zhengXinTang  阅读(365)  评论(0)    收藏  举报