Neural Network Hyperparameters

Most machine learning algorithms involve “hyperparameters” which are variables set before actually optimizing the model's parameters. Setting the values of hyperparameters can be seen as model selection, i.e. choosing which model to use from the hypothesized set of possible models. Hyperparameters are often set by hand, selected by some search algorithm, or optimized by some “hyper-learner”.

Neural networks can have many hyperparameters, including those which specify the structure of the network itself and those which determine how the network is trained. This document describes the hyperparameters typically encountered when training neural networks and covers some common techniques for setting them, following the discussion in Section 3 of (Bengio 2012)¹⁾. In particular, we will focus on feed-forward neural nets trained with mini-batch gradient descent. When appropriate, recommendations for hyperparameter choices are given, which should be taken with many grains of salt.

Mini-Batch Gradient Descent Hyperparameters

When training a neural network, the resulting model will depend not only on the chosen structure but also on the training method used to set the network's parameters. The training method itself can have many hyperparameters. Here, we describe the hyperparameters of mini-batch gradient descent, which updates the network's parameters using gradient descent on a subset of the training data (which is periodically shuffled, or assumed infinite). We'll define the

θ (t) \leftarrow θ (t - 1) - ϵ t 1 B \sum t ' = B t + 1 B ( t + 1 )

where

Learning Rate

The learning rate

Loss function

The loss function compares the network's output for a training example against the intended ground truth output. A common general-purpose loss function is the squared Euclidian distance, given by

L = 1 2 \sum i ( y i - z i ) 2

where

L = - \sum i y i log (z i)

Mini-batch Size

Theoreticlaly, the choice of

Number of Training Iterations

The most common way to set

Momentum

A very common technique is to “smooth” the gradient updates using a leaky integrator filter with parameter

g ¯ \leftarrow (1 - β) g ¯ + β \partial L ( z t , θ ) \partial θ

Model Hyperparameters

The structure of the neural network itself involves numerous hyperparameters in its design, including the size and nonlinearity of each layer. The numeric properties of the weights are often also constrained in some way, and their initialization can have a strong effect on model performance. Finally, preprocessing of the input data can also be important for ensuring convergence. As a practical note, many hyperparameters can vary across layers.

Number of Hidden Units

Large hidden layers can allow the neural network to fit the training data arbitrarily well, but because regularization is typically used, it’s mostly important to just use large hidden layers. Using the same size for all hidden layers generally works better or the same as using a decreasing or increasing size. In addition, using a first hidden layer which is larger than the input layer tends to work better. When using unsupervised pre-training, the layers should be made much bigger than when doing purely supervised optimization.

Weight Decay

To reduce overfitting, a regularization on the network weights is sometimes added to the training criterion (loss function). When encouraging the network weights

This regularization can be viewed as a negative log-prior on the parameters. In this interpretation, in the mini-batch case, the gradient of the regularization penalty should be multiplied by

Activation Sparsity

It may be advantageous for the hidden unit activations to be sparse. One way to enforce this is to use an L1 penalty (as discussed above) on the hidden unit parameters, provided that the activation has a saturating output around

Nonlinearity

Commonly used nonlinearities include: The sigmoid

Weight Initialization

Biases are typically initialized to

Random Seeds and Model Averaging

Many of the processes involved in training a neural network involve using a random number generator (e.g. random sampling of training data, weight initialization, etc). As a result, the seed passed to the random number generator can have a slight effect on the results. However, a different random seed can produce a non-trivially different model (even if it performs about as well). As a result, it’s common to train multiple models with multiple random seeds and use model averaging (bagging, Bayesian methods) to improve performance.

Preprocessing Input Data

The statistics of the input data can have a strong effect on network performance. Element-wise standardization (subtract the mean and divide by the standard deviation), Principal Component Analysis, uniformization (transform each feature value to its approximate normalized rank or quantile), and nonlinearities such as the logarithm or square root are common.

Hyperparameter Space Exploration

The number of hyperparameters delineated above indicate that there are a substantial number of choices to be made when creating a neural network learner, and that these choices will affect the success and failure of the model. In order to ensure reproducibility of results, a principled approach should be used for setting hyperparameters, or, at the very least, they should be explicitly stated in the model description. If a human is involved in hyperparameter search and the results are not stated explicitly, the results are not reproducible.

Hyperparameter selection can be seen as both an optimization problem (which is not necessarily convex in any single variable) and a generalization problem (because overfitting is still possible). It is made especially difficult due to the computational cost; each hyperparameter setting should be used to train a new model, and model training is typically expensive. However, it has been shown that for some hyperparameters, the best setting can be obtained from a cheaper estimator (e.g. randomly setting weights).

In most cases, a range of values is tried for each hyperparameter. It’s always possible that the best value falls on the edge of this range, which my indicate that a better value lies outside of the range. Due to non convexity, however, a “best” value on the interior of the search interval still doesn’t ensure that a better value doesn’t fall outside the interval. The “scale” of the interval also needs to be chosen (e.g. how the values are sampled). It often makes most sense to set the interval as a logarithmic range because the ratio between different values is often a better guide of the expected impact of the change. Once a good solution is found, we can adjust the scale and range that we’re searching over in order to fine-tune our best hyperparameter setting.

Coordinate Descent

We can apply the idea of coordinate descent to hyperparameter optimization - keep all hyperparameters fixed except for one, and adjust that hyperparameter to minimize the validation error.

Grid Search

Grid search simply tries every hyperparameter setting over a specified range of values. This involves a cross-product of all intervals, so the computational expense is exponential in the number of parameters. Fortunately, it can be easily parallelized, but care should be taken to ensure that if one job fails it fails gracefully; otherwise a portion of the hyperparameter space could be left unexplored. Typically, a user-driven refinement approach is also used where a large-scale coarse search is first carried out, followed by successively more fine-tuned and fine-grained searches.

Random Search

A straightforward alternative to grid search is to sample the hyperparameter space randomly. This is similarly trivially parallelizable and can work much better than grid search in practice because grid search can take an exponentially long time to reach a good hyperparameter subspace, and because only a few hyperparameters tend to matter a lot. We can also readily introduce hyperparameter distributions (continuous variables are typically uniform in the log domain, inside the interval of interest; discrete parameters are typically multinomially distributed) and encode conditional dependence between hyperparameters. The search can be terminated once the validation error plateaus.

Model-based Methods

Recently, various sequential model-based optimization methods have been proposed to search the hyperparameter space in a principled way. These techniques include modeling the generalization performance as a sample from a Gaussian process and as a graph-structured generative process using a tree-structured Parzen estimator. These approaches are implemented in the Hyperopt ²⁾, Spearmint ³⁾ and SMAC ⁴⁾ packages.

¹⁾ Practical recommendations for gradient-based training of deep architectures, Yoshua Bengio, U. Montreal, arXiv report:1206.5533, Lecture Notes in Computer Science Volume 7700, Neural Networks: Tricks of the Trade Second Edition, Editors: Grégoire Montavon, Geneviève B. Orr, Klaus-Robert Müller, 2012.

²⁾ http://jaberg.github.io/hyperopt/

³⁾ https://github.com/JasperSnoek/spearmint

⁴⁾ http://www.cs.ubc.ca/labs/beta/Projects/SMAC/

posted @ 2015-12-23 19:34 菜鸡一枚阅读(597) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

菜鸡一枚

Neural Network Hyperparameters

Neural Network Hyperparameters

Mini-Batch Gradient Descent Hyperparameters

Learning Rate

Loss function

Mini-batch Size

Number of Training Iterations

Momentum

Model Hyperparameters

Number of Hidden Units

Weight Decay

Activation Sparsity

Nonlinearity

Weight Initialization

Random Seeds and Model Averaging

Preprocessing Input Data

Hyperparameter Space Exploration

Coordinate Descent

Grid Search

Random Search

Model-based Methods

公告