Deep Learning Checklist
Recently I started the Deep Learning Specialization on Coursera. While I studied neural networks in my masters program (from Andrew Ng himself!), that was a long time ago and the field has changed considerably since then. I am supplementing the course by reading Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, which I will refer to as GBC16.
What really strikes me is how much the hyperparameters—the learning rate, parameter initialization, activation function, etc.—impact the performance of the network. My primary focus area in my masters program was Convex Optimization, and when an optimization problem is convex, there is pretty much one “best” learning rate1, and the initialization is not terribly important either. Sure, the choice of algorithm depends on things like number of parameters and the data size, but it typically is pretty easy to select a method and hyperparameters that are going to work pretty well, right off the bat.
I think a big reason for the difference is that Convex Optimization is now a fairly well-understood field. As Stephen Boyd said, it’s now a science more than an art. In contrast, as GBC16 reports, there is much we do not understand about optimization of neural networks, and so there is much art in tuning performance.
In Deep Learning, not only are the hyperparameters hugely impactful, there are so many of them! It was making me a little dizzy trying to keep them all straight. I therefore decided to write a little cheat sheet of things to start tweaking to improve performance. As I continue to learn, I will update this accordingly.
Things to Tune
- Learning rate
- Optimization algorithm
- Architecture
- Parameter initialization
- Regularization
- Activation function
Learning rate
As Andrew Ng has said, the most important hyperparameter to tune is the learning rate, so that is a reasonable place to start. When using stochastic gradient descent, it is important that the learning rate decreases over time, but not too quickly. Specifically, if $\epsilon_k$ is the learning rate used in iteration $k$, we want $\sum_{k=1}^\infty \epsilon_k \to \infty$ but $\sum_{k=1}^\infty \epsilon_k^2$ to be finite (GBC16 $\S$8.3.1).
GBC16 recommends using a learning rate $$ \epsilon_k = \begin{cases} (1 - k / \tau) \epsilon_0 + \frac{k}{\tau} \epsilon_\tau & k \leq \tau \\ \epsilon_\tau & k > \tau, \end{cases} $$ suggesting to use $\epsilon_\tau = 0.01 \epsilon_0$ and setting $\tau$ high enough so that we pass through the training set a few hundred times before the learning rate becomes constant. Thus, we just need to select $\epsilon_0$. When $\epsilon_0$ is too high, the cost function may increase dramatically as the algorithm oscillates; if it is too low the cost function will decrease too slowly. It should be possible to observe performance over the first few minibatches and alter accordingly.
Andrew Ng mentions learning rate decay: $$ \epsilon_k = \frac{\epsilon_0}{1 + \delta \cdot n}, $$ where $\delta$ is a decay rate to be tuned and $n$ is the epoch number corresponding to how many passes we have made through the data. He also mentions exponential decay: $\epsilon_k = \delta^n \cdot \epsilon_0$, $\delta < 1$ or square root decay: $\epsilon_k = \epsilon_0 / \sqrt{n}$.
Optimization algorithm
Simple stochastic gradient descent can often be improved by using momentum. Many algorithms based on momentum can be thought of as adapting the learning rate. Perhaps a good starting point is to use Adam, but other good options include Stochastic Gradient Descent and RMSProp, both with and without momentum. Andrew Ng cites the momentum parameter as amongst the second-most important parameters to examine, after the learning rate.
The size of the minibatch also influences performance, especially training time. Typically the size of the minibatch should be a power of 2, and it should be as large as possible while still fitting in the cache. If we plot the training time for a single minibatch as a function of the batch size, we should see a slow increase followed by a cliff-like dramatic increase, corresponding to a cache miss. Consider setting the minibatch size to be the largest possible without going over that cliff, as a reasonable starting point.
Batch normalization can also improve performance. Just as it is wise to normalize the features to have zero mean and unit variance, batch normalization, applied either to the activation values or to the inputs to the activation functions (Andrew Ng recommends the latter), has the effect of decoupling early layers from later ones. Given the $z$ values of some node in an intermediate layer, compute $$ \begin{align} \mu &= \frac{1}{m} \sum_i z^{(i)} \\ \sigma^2 &= \frac{1}{m} \sum_i (z^{(i)} - \mu)^2 \\ z_\textrm{norm}^{(i)} &= \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} \\ \tilde{z}^{(i)} &= \gamma z_\textrm{norm}^{(i)} + \beta, \end{align} $$ where the sum is over the training examples in a particular minibatch. This process is repeated for each node in a layer, for whichever layers we care to apply it. The parameters are $\epsilon$, which really is just there to prevent divide-by-zero, and $\gamma$ and $\beta$ are learnable parameters governing the mean and variance of the node “feature”. The $\tilde{z}^{(i)}$ values are then fed into the activation functions instead of $z^{(i)}$. Notably, using both a bias term and $\beta$ is redundant, so when using batch normalization, omit the bias term. There also is a nuance to applying this during test time: we need to use values of $\mu$ and $\sigma^2$ that are computing using an exponentially weighted running average over the minibatches.
Architecture
Especially if the training error is high, we should consider expanding the capacity of the network either with more layers or more hidden units per layer. Andrew Ng mentions that with modern hardware, it is best to use as large a network as is feasible and use regularization to prevent overfitting, rather than defaulting to a smaller network.
Often, the domain dictates certain architectures such as Convolutional Neural Networks and Recurrent Neural Networks. These constraints could be thought of as a type of regularization as well, enforcing parameter sharing.
Parameter initialization
Weight parameters should be initialized randomly to break symmetry. Bias parameters can typically be set to zero, but especially for the output layer, it can be helpful to set them to match the output statistics as if the weights were zero. For example, for binary classification, if 75% of the observations have label 1, then the bias, $b$, should be set so that $\sigma(b) = 0.75$.
Andrew Ng used a zero-mean Gaussian distribution for initializing parameters with variance $2 / m$, where $m$ is the number of inputs to the layer, or sometimes just variance $0.1$, which corresponds to a network with about 20 elements in each layer. GBC16 generates the initial parameters from a uniform distribution with endpoints $\pm m^{-1/2}$ or $\pm \left(\frac{m+n}{6} \right)^{-1/2}$, where $m$ and $n$ are the number of inputs and outputs of the layer.
The initialization can be treated as a hyperparameter to be tuned, but Andrew Ng mentioned it is fairly low on his list of things to examine.
Regularization
GBC16 defines regularization as any method that intentionally increases the training error in the hopes of decreasing the test error. Weight decay, also known as $L_2$ regularization encourages the weights to be small, while $L_1$ regularization results in sparsity.
Augmenting the dataset, perhaps with synthetic data, can also be thought of as a type of regularization. It is especially applicable for computer vision. Options include translating, scaling, rotation, mirroring, skewing, blurring, and changing contrast.
Adversarial training is another option: look for synthetic examples very similar to one in the dataset where the network reports different labels for the real and synthetic examples. We would assume the “real” label on the synthetic example should actually be the same, so we are looking for examples where the network gets it wrong. The dataset could then be augmented with these problematic examples and retrained.
GBC16 speaks highly of early stopping, comparing it with weight decay, but Andrew Ng discourages its use, claiming that it entangles the optimization process with hyperparameter tuning. For this reason, it might not be the first thing to try.
GBC16 and Andrew Ng both speak highly of dropout, wherein certain nodes are randomly dropped from the network during training, but emphasize that applying it properly in training and prediction requires careful attention to the details (such as remembering to scale the activation values by the inverse of the keep probability during training to ensure the expected activation magnitude remains the same). It seems to me that dropout can improve the robustness of the network, since each node is forced to be more reliable in isolation.
Another option that seems like it would improve the robustness of the network is multitask learning. Instead of optimizing for performance on a single cost function, optimize over several cost functions capturing different real-world performance aspects. This seems like it would be helpful to avoid overfitting to a particular cost function that we may not strictly care about anyway. A network that performs well on a variety of important tasks is preferable to a network that has only been trained to work well on one single task.
Activation function
Functions that “saturate at both ends” like sigmoid and tanh are not appropriate for hidden layers. Instead, ReLU is the preferred default option. But there are variations like the Leaky ReLU worth considering. The Leaky ReLU is given by:
$$ g(z) = \max\{z, \alpha z\}, $$
where $\alpha$ is typically small, like 0.01. When $\alpha = 0$, we get the regular ReLU. The intuition is that for the regular ReLU, any instances where $z < 0$ do not contribute to the gradient and so do not contribute to learning, whereas when $\alpha > 0$, all nodes contribute to learning. This value can be tuned. Note that the Leaky ReLU is a piecewise linear function with two sections; GBC16 mentions the possibility of letting the number of sections itself be a hyperparameter, allowing a piecewise (but convex) activation function that itself is trained. These are called maxout units.
-
I am referring to interior point methods as described in Chapter 11 of Boyd and Vandenberghe’s Convex Optimization textbook ↩︎