Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lecture 6: Modern Neural Networks

HTML Slides | PDF Slides

Intro to modern neural networks

COMP 4630 | Winter 2024 Charlotte Curtis


Overview


Revisiting Backpropagation

  • For a network with layers, the gradients of the loss function with respect to the weights in the last layer are given by:

    assuming that the output is a function of layer ’s input .

  • At layer , the gradients are computed as:


  • At layer , this becomes:

  • And so on, until we reach the first layer.

  • We are recursively applying the chain rule and re-using the gradients computed at the previous layer

  • This is great for computational efficiency, but it can also lead to vanishing or exploding gradients


Vanishing and Exploding Gradients

  • Vanishing/exploding gradients are where the gradients become near zero or near infinity as they are propagated back through the network
  • Particularly problematic for recurrent neural networks, where the same weights are multiplied by themselves repeatedly
  • Also a problem for very deep networks, and part of the reason that deep learning was not popular until the 2010s
  • ❓ What changed?

Consider the variance

  • At the input layer, , and has some variance
  • Assume and are initialized to 0
  • :abacus: What is the variance of ?
  • What about after the activation function ?

Initialization strategies

  • In 2010, Glorot and Bengio proposed the Xavier initialization for a layer with inputs and outputs:
  • Goal is to preserve the variance of the input and output in both directions
  • Similar to LeCun initialization, and apparently an overlooked feature of networks from the 1990s

Initialization for ReLU

  • Glorot initialization was derived under the assumption of linear activation functions (even though they knew this wasn’t the case)
  • In 2015, He et al. proposed the He initialization specifically for ReLU activations:
  • The choice of normal vs uniform is apparently not very important
  • Default in PyTorch is

Batch normalization

  • Also in 2015, Ioffe and Szegedy proposed batch normalization as a way to mitigate vanishing/exploding gradients
  • This is simply a normalization at each layer, shifting and scaling the inputs to have a mean of 0 and a variance of 1 (across the batch)
  • A moving average of the mean and variance is maintained during training, and used for normalization during inference
  • It also ends up acting as regularization, magic!
  • ❓ Why wouldn’t you want to use batch normalization?

RELU and its variants

  • In early works, the sigmoid or tanh functions were popular
  • Both have a small range of non-zero gradients
  • ReLU has a stable gradient for positive inputs, but can lead to the dying ReLU problem whereby certain neurons are “turned off”
  • ❓ How can we prevent dying ReLUs?

Note: this may not be a problem, and ReLU is cheap. Don’t optimize prematurely unless you’re seeing lots of “dead” neurons.


Number of neurons and layers

  • Number of neurons in the input layer is defined by number of features
  • Number of neurons in the output layer is defined by prediction task
  • In between is a design choice
  • Common early choice was a pyramid shape, but it turns out that a stack of layers with the same number of neurons works well too
  • Deeper networks can solve more complex problems with the same number of total parameters, but are also prone to vanishing/exploding gradients

Ultimately, yet another a hyperparameter to be tuned


Optimization algorithms: variations on gradient descent

  • Gradient descent takes small regular steps, constant or otherwise
  • Many variations exist! For example, momentum keeps track of the previously computed gradient and uses it to inform the new step: where is a hyperparameter between 0 and 1
  • Adaptive moment estimation (Adam) is a popular choice that adds on an exponentially decaying average of the squared gradients

The Adam optimizer

  • Keeps track of first () and second () moments of the gradient, with two exponential decay terms and
  • At each time step, the update is now:
  • and are typically 0.9 and 0.999, respectively

Regularization via dropout

  • Dropout is a regularization technique that randomly sets a fraction of the neurons to zero during training
  • During each training pass, a neuron has a probability of being dropped
  • Similar to training an ensemble of models then bagging
  • Helps to prevent overfitting, but can slow down training
  • Typical values: 0.5 for hidden layers, 0.2 for input layers

Choices for starting

From the Deep Learning Book, section 11.2:

  • “Depending on the complexity of your problem, you may even want to begin without using deep learning.”
  • Tabular data: fully connected, images: convolutional, sequences: recurrent*
  • ReLU or variants, with He initialization
  • SGD or Adam, add batch normalization if unstable
  • Use some kind of regularization, such as dropout
*This book was written before transformer models

Implementation time