tricks or tips

sec. 1: Data Augmentation

  • horizontally flipping, random crops and color jittering
  • fancy PCA

sec. 2: Pre-Processing

  • zero center and normailze
X -= np.mean(X, axis=0) #zero center
X /= np.std(X, axis=0)
  • PCA Whitening
X -= np.mean(X, axis=0) # zero-center
cov =, X) / X.shape[0]

U,S,v = np.linalg.svd(cov)
Xrot =, U)

Xwhite = Xrot / np.sqrt(S+1e-5)

sec. 3: Initialization

  • all zero initialization

    all the neurons compute the same gradients.

  • Initialization with Small Random Numbers

    weights ~ 0.001 * N(0,1)

  • Calibrating the Variances

    the outputs from a randomly initialized neuron has a variance grows with the number of inputs

    \[\begin{align} Var(X) &= E(X^2)-E(X)^2\\ Var(s) &= Var(\sum_{i=1}^nw_ix_i)\\ &= \sum_{i=1}^n\{E(w_i^2x_i^2)-E(w_ix_i)^2\}\\ &= \sum_{i=1}^n\{E(w_i^2)E(x_i^2)-E(w_ix_i)^2\}\\ &= \sum_{i=1}^n\{Var(w_i)Var(x_i)-2E(w_ix_i)^2+E(w_i^2)E(x_i)^2+E(w_i)^2E(x_i^2)\}\\ &= \sum_{i=1}^n\{Var(w_i)Var(x_i)\}\\ &= nVar(w)Var(x) \end{align} \]

w = np.random.randn(n)/sqrt(n) # n: the number of inputs
  • Current Recommendation

an initialization specifically for Relus:

w = np.random.randn(n)*sqrt(2.0/n)

Sec. 4: During Training

  • Learning rate: divide the LR by 2 (or by 5)
  • Fine-tune on pre-trained models on your own data
very similar dataset very different dataset
very little data Use linear classification on top layer Try linear classification from different stages
quite a lot of data Finetune a few layers Finetune a large number of layers

Sec. 5: Activation Functions

  • Sigmoid

    Cons: Sigmoids saturate and kill gradients & outputs are not zero centered

  • tanh

    Cons: saturate and kill gradients

  • Rectified Linear Unit

    Pros: Comutationally & non-saturating form Cons: dying ReLU

  • Leaky ReLU

  • Parametric ReLU

  • Randomized ReLU

Sec. 6: Regularization

  • L2 regularization : heavily penalizing peaky weight vectors and preferring diffuse weight vectors
  • L1 regularization : explicit feature selection
  • Max norm constraints: enforce an absolute upper bound on the magnitude of the weight vector
  • Dropout: sampling a Neural Network with the full Neural Network

Sec. 7: Insights from Figures

  • The loss curve: linear - low learning rate; doesn't decrease much - high learning rate
  • accuracy curve: big gap - increase regularization no gap - increase model capacity

Sec. 8: Ensemble

  • Same model, different initialization
  • Top models discovered during cross-validation
  • Different checkpoints of a single model
  • early fusion & late fusion
posted @ 2018-04-10 16:02  Blueprintf  阅读(84)  评论(0编辑  收藏