Lecture 4: Backpropagation

HTML Slides | PDF Slides

Backpropagation

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

A brief review of the history of neural networks
Neurons, perceptrons, and multilayer perceptrons
Backpropagation
References and suggested reading:
- Scikit-learn book: Chapter 10, introduction to artificial neural networks
- Deep Learning Book: Chapter 6, deep feedforward networks

The rise and fall of neural networks

In between each era of excitement and advancement there was an “AI winter”

Model of a neuron

McCulloch and Pitts (1943)
Neuron as a logic gate with time delay
“Activates” when the sum of inputs exceeds a threshold
Non-invertible (forward propagation only)

Threshold Linear Units (TLUs)

Linear I/O instead of binary
Rosenblatt (1957) combined multiple TLUs in a single layer
Physical machine: the Mark I Perceptron, designed for image recognition
Criticized by Minsky and Papert (1969) for its inability to solve the XOR problem - first AI winter

A single threshold logic unit (TLU)

Image source: Scikit-learn book

Training a perceptron

Hebb’s rule: “neurons that fire together, wire together”

$w_{ij}^{(u p d a t e d)} = w_{ij} + η (y_{j} - \overset{y}{^}_{j}) x_{i}$

where $i$ = input, $j$ = output
Fed one instance at a time,
Guaranteed to converge if inputs are linearly separable
:abacus: Simple example: AND gate

A perceptron with two inputs and three outputs

Image source: Scikit-learn book

Multilayer perceptrons (MLPs)

If a perceptron can’t even solve XOR, how can it do higher order logic?
Consider that XOR can be rewritten as:
```
A xor B = (A and !B) or (!A and B)
```
A perceptron can solve and and or and not… so what if the input to the or perceptron is the output of two and perceptrons?

A solution to XOR

h:500 center

Backpropagation

I just gave you the weights to solve XOR, but how do we actually find them?
Applying the perceptron learning rule no longer works, need to know how much to adjust each weight relative to the overall output error
Solution presented in 1986 by Rumelhart, Hinton, and Williams
Key insight: Good old chain rule! Plus some recursive efficiencies

Training MLPs with backpropagation

Initialize the weights, through some random-ish strategy
Perform a forward pass to compute the output of each neuron
Compute the loss of the output layer (e.g. MSE)
Calculate the gradient of the loss with respect to each weight
Update the weights using gradient descent (minibatch, stochastic, etc)
Repeat steps 2-5 until stopping criteria met

Step 4 is the “backpropagation” part

Example: forward pass

With a linear activation function: $\hat{y} = X W^{(1)} W^{(2)}$
In summation notation for a single sample: $\overset{y}{^} = j = 1 \sum 2 w_{j}^{(2)} i = 1 \sum 2 x_{i} w_{ij}^{(1)}$
In this case, $\overset{y}{^} = 2.162$

center

$X = [23], y = 1$

and

$W^{(1)} = [- 0.78 0.85 0.13 0.23], W^{(2)} = [1.8 0.40]$

Example: calculate error and gradient

We never picked a loss function! Let’s assume we’re using MSE
For a single sample: $L (w^{(1)}, w^{(2)}) = \frac{1}{2} (\overset{y}{^} - y)^{2} = \frac{1}{2} (j = 1 \sum 2 w_{j}^{(2)} i = 1 \sum 2 x_{i} w_{ij}^{(1)} - y)^{2}$ with the $1/2$ added for convenience
The goal is to update each weight by a small amount to minimize the loss
Fortunately, we know how to find a small change in a function with respect to one of the variables: the partial derivative!

Recursively applying the chain rule

Weights in the second layer (connecting hidden and output): $\frac{\partial L}{\partial w _{j}^{(2)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial w _{j}^{(2)}} = (\overset{y}{^} - y) \frac{\partial y ^}{\partial w _{j}^{(2)}} = (\overset{y}{^} - y) i \sum x_{i} w_{ij}^{(1)}$
For the first layer (connecting inputs to hidden): $\frac{\partial L}{\partial w _{ij}^{(1)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{h _{j}} \frac{\partial h _{j}}{\partial w _{ij}^{(1)}} = (\overset{y}{^} - y) w_{j}^{(2)} x_{i}$ where $h_{j} = x_{i} w_{ij}^{(1)}$ is the output of the hidden layer

Bias terms

The toy example did not include bias terms, but these are very important (as seen in the perceptron examples)
With a single layer we can add a column of 1s to $X$ , but with multiple layers we need to add bias at every layer
The forward pass becomes: $\hat{y} = (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$
Or in summation form: $\overset{y}{^} = j = 1 \sum 2 w_{j}^{(2)} (i = 1 \sum 2 x_{i} w_{ij}^{(1)} + b^{(1)}) + b^{(2)}$

Gradient with respect to the bias terms

For layer 2 (the output layer):

$\frac{\partial L}{\partial b _{j}^{(2)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial b ^{(2)}} = (\overset{y}{^} - y) (1)$

For layer 1: $\frac{\partial L}{\partial b ^{(1)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{h} \frac{\partial h}{\partial b ^{(1)}} = (\overset{y}{^} - y) i \sum w_{ij}^{(2)}$

where $h = \sum_{i} x_{i} w_{ij}^{(1)} + b^{(1)}$ is the input to the hidden layer

Summary in matrix form

Parameter	Gradient
Weights of layer 2	$\frac{\partial L}{\partial W ^{(2)}} = (\overset{y}{^} - y) (X W^{(1)} + b^{(1)})$
Bias of layer 2	$\frac{\partial L}{\partial b ^{(2)}} = (\overset{y}{^} - y)$
Weights of layer 1	$\frac{\partial L}{\partial W ^{(1)}} = (\overset{y}{^} - y) W^{(2)} X$
Bias of layer 1	$\frac{\partial L}{\partial b ^{(1)}} = (\overset{y}{^} - y) W^{(2)}$

Computational considerations

Many of the terms computed in the forward pass are reused in the backward pass (such as the inputs to each layer)
Similarly, gradients computed in layer $l + 1$ are reused in layer $l$
Typically each intermediate value is stored, but modern networks are big

Model Parameters

Our example 6

AlexNet (2012) 60 million

GPT-3 (2020) 175 billion

Model	Parameters
Our example	6
AlexNet (2012)	60 million
GPT-3 (2020)	175 billion

Choices in neural network design

Activation functions

The simple example used a linear activation function (identity)
To include other activation functions, the forward pass becomes: $\hat{y} = f_{2} (f_{1} (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)})$
The gradient in the output layer becomes: $\frac{\partial L}{\partial W ^{(2)}} = \frac{\partial L}{\partial f _{2}} \frac{\partial f _{2}}{\partial z ^{(2)}} \frac{\partial z ^{(2)}}{\partial W ^{(2)}}$ where $z^{(2)} = f_{1} (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$ , or the summation the second layer before applying the activation function
Problem! That step function in the original perceptron is not differentiable

Activation functions

A common early choice was the sigmoid function: $σ (z) = \frac{1}{1 + e ^{- z}}, \frac{d σ}{d z} = σ (z) (1 - σ (z))$
A more computationally efficient choice common today is the “ReLU” (Rectified Linear Unit) function: $ReLU (z) = max (0, z), \frac{d ReLU}{d z} = {01 z < 0 z > 0$

Activation functions in hidden layers

The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles. – Deep Learning Book, Section 6.3

Activation functions in hidden layers serve to introduce nonlinearity
Common for multiple hidden layers to use the same activation function
Sigmoid, ReLU, and tanh (hyperbolic tangent) are common choices
Also “leaky” ReLU, Parameterized ReLU, absolute value, etc
Can be considered a hyperparameter of the network

Loss functions

The choice of loss function is very important!
Depends on the task at hand, e.g.:
- Regression: MSE, MAE, etc
- Classification: Usually some kind of cross-entropy (log likelihood)
May or may not include regularization terms
Must be differentiable, just like the activation functions

Activation functions in the output layer

Activation functions in the output layer should be chosen based on the loss function (and thus the task)
- Regression: linear
- Binary classification: sigmoid
- Multiclass classification: softmax (generalization of sigmoid)
Again, must be differentiable

A complete fully connected network

center h:450

Next up: Classification loss functions and metrics

Keyboard shortcuts

COMP 4630 | Winter 2026