原来输入值过大导致非线性梯度衰减是10年前他们解决的问题
论文:Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
In practice,
the saturation problem and the resulting vanishing gradi
ents are usually addressed by using Rectifified Linear Units
(Nair & Hinton, 2010) ReLU(x) = max(x, 0), careful
initialization (Bengio & Glorot, 2010; Saxe et al., 2013),
and small learning rates. If, however, we could ensure
that the distribution of nonlinearity inputs remains more
stable as the network trains, then the optimizer would be
less likely to get stuck in the saturated regime, and the
training would accelerate.

浙公网安备 33010602011771号