Lecture 6: Modern Neural Networks
Intro to modern neural networks
COMP 4630 | Winter 2024 Charlotte Curtis
Overview
- More decisions when making a neural network
- Weight initialization
- Number of neurons and layers
- Optimization algorithms
- References and suggested reading:
- Scikit-learn book: Chapters 10-11
- Deep Learning Book: Chapter 8
- Understanding Deep Learning: Chapter 7
Revisiting Backpropagation
-
For a network with layers, the gradients of the loss function with respect to the weights in the last layer are given by:
assuming that the output is a function of layer ’s input .
-
At layer , the gradients are computed as:
-
At layer , this becomes:
-
And so on, until we reach the first layer.
-
We are recursively applying the chain rule and re-using the gradients computed at the previous layer
-
This is great for computational efficiency, but it can also lead to vanishing or exploding gradients
Vanishing and Exploding Gradients
- Vanishing/exploding gradients are where the gradients become near zero or near infinity as they are propagated back through the network
- Particularly problematic for recurrent neural networks, where the same weights are multiplied by themselves repeatedly
- Also a problem for very deep networks, and part of the reason that deep learning was not popular until the 2010s
- ❓ What changed?
Consider the variance
- At the input layer, , and has some variance
- Assume and are initialized to 0
- :abacus: What is the variance of ?
- What about after the activation function ?
Initialization strategies
- In 2010, Glorot and Bengio proposed the Xavier initialization for a layer with inputs and outputs:
- Goal is to preserve the variance of the input and output in both directions
- Similar to LeCun initialization, and apparently an overlooked feature of networks from the 1990s
Initialization for ReLU
- Glorot initialization was derived under the assumption of linear activation functions (even though they knew this wasn’t the case)
- In 2015, He et al. proposed the He initialization specifically for ReLU activations:
- The choice of normal vs uniform is apparently not very important
- Default in PyTorch is
Batch normalization
- Also in 2015, Ioffe and Szegedy proposed batch normalization as a way to mitigate vanishing/exploding gradients
- This is simply a normalization at each layer, shifting and scaling the inputs to have a mean of 0 and a variance of 1 (across the batch)
- A moving average of the mean and variance is maintained during training, and used for normalization during inference
- It also ends up acting as regularization, magic!
- ❓ Why wouldn’t you want to use batch normalization?
RELU and its variants
- In early works, the sigmoid or tanh functions were popular
- Both have a small range of non-zero gradients
- ReLU has a stable gradient for positive inputs, but can lead to the dying ReLU problem whereby certain neurons are “turned off”
- ❓ How can we prevent dying ReLUs?
Note: this may not be a problem, and ReLU is cheap. Don’t optimize prematurely unless you’re seeing lots of “dead” neurons.
Number of neurons and layers
- Number of neurons in the input layer is defined by number of features
- Number of neurons in the output layer is defined by prediction task
- In between is a design choice
- Common early choice was a pyramid shape, but it turns out that a stack of layers with the same number of neurons works well too
- Deeper networks can solve more complex problems with the same number of total parameters, but are also prone to vanishing/exploding gradients
Ultimately, yet another a hyperparameter to be tuned
Optimization algorithms: variations on gradient descent
- Gradient descent takes small regular steps, constant or otherwise
- Many variations exist! For example, momentum keeps track of the previously computed gradient and uses it to inform the new step: where is a hyperparameter between 0 and 1
- Adaptive moment estimation (Adam) is a popular choice that adds on an exponentially decaying average of the squared gradients
The Adam optimizer
- Keeps track of first () and second () moments of the gradient, with two exponential decay terms and
- At each time step, the update is now:
- and are typically 0.9 and 0.999, respectively
Regularization via dropout
- Dropout is a regularization technique that randomly sets a fraction of the neurons to zero during training
- During each training pass, a neuron has a probability of being dropped
- Similar to training an ensemble of models then bagging
- Helps to prevent overfitting, but can slow down training
- Typical values: 0.5 for hidden layers, 0.2 for input layers
Choices for starting
From the Deep Learning Book, section 11.2:
- “Depending on the complexity of your problem, you may even want to begin without using deep learning.”
- Tabular data: fully connected, images: convolutional, sequences: recurrent*
- ReLU or variants, with He initialization
- SGD or Adam, add batch normalization if unstable
- Use some kind of regularization, such as dropout