Lecture 7: Convolutional Neural Networks

HTML Slides | PDF Slides

Convolution and Convolutional Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Convolutional neural networks (CNNs) are a type of neural network that is particularly well-suited to image data
Before we can understand CNNs, we need to understand convolution
References and suggested reading:
- Scikit-learn book: Chapter 14
- Deep Learning Book: Chapter 9
- 3blue1brown video: What is convolution?

Convolution

Convolution is defined as: $(f * g) (t) = \int_{- \infty}^{\infty} f (τ) g (t - τ) d τ$
Or in the discrete case: $(f * g) [n] = m = - \infty \sum \infty f [m] g [n - m]$
Can be though of as “flipping” one function and sliding it over the other, multiplying and summing at each point

Example

We’re in a hospital dealing with an outbreak. For the first 5 days we have 1 patient on Monday, 2 on Tuesday, etc:

$patients (x) = [1, 2, 3, 4, 5]$

Fortunately, we know how to treat them: 3 doses on day 1, then 2, then 1:

$doses (x) = [3, 2, 1]$

And after 3 days they’re cured.

How many doses do we need on each day?

Convolution in 2D

Extending to 2D basically adds another summation/integration: $(f * g) [n, m] = (f * g) (x, y) = i = 0 \sum n j = 0 \sum m f [i, j] g [n - i, m - j] \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f (u, v) g (x - u, y - v) d u d v$
This can also be extended to higher dimensions
Caution: a colour image is a 3D array, not a 2D array
For typical image processing applications, the colour channels are convolved independently such that the output is still a 3D array

Convolution kernels

Typically there is a small kernel that is convolved with the input
This is just the smaller of the two functions in the convolution
❓ What happens at the edges of the input?

center

Some common kernels

Averaging: $\frac{1}{9} 111111111$
Differentiation: $- 1 - 1 - 1 000111$

Sizes are commonly chosen to be 3x3, 5x5, 7x7, etc.
❓ Why divide by 9?
❓ Why odd sizes?
❓ What effect do you think these kernels will have on an image?

A side tangent on frequency representation

Any signal can be represented as a weighted summation of sinusoids
For a discrete signal $x [n]$ , you can think of this as: $x [n] = k = 0 \sum N - 1 [a_{k} cos (\frac{2 πkn}{N}) + b_{k} sin (\frac{2 πkn}{N})]$
Or, using Euler’s formula $e^{j θ} = cos θ + j sin θ$ : $x [n] = k = 0 \sum N - 1 c_{k} e^{j \frac{2 πkn}{N}}$ where the complex coefficients $c_{k} = a_{k} + j b_{k}$

Fourier Transform

To figure out what the coefficients $c_{k}$ are, we can use the Discrete Fourier Transform (DFT): $X [k] = n = 0 \sum N - 1 x [n] e^{- j \frac{2 πkn}{N}}$ where each element of $X [k]$ is the coefficient $c_{k}$ for frequency $k$
The Fast Fourier Transform (FFT) computes the DFT in $O (n lo g n)$ time
Convolution is $O (n^{2})$

Convolution is multiplication in frequency

Sometimes it is useful to use the Fourier Transform to obtain a frequency representation of an image (or signal)
In the frequency domain, convolution is simply element-wise multiplication
This allows for some efficient operations such as blurring/sharpening an image, as well as some fancy stuff like deconvolution
Sharp edges in space become ringing in frequency, and vice versa
❓ why do CNNs operate in the spatial domain?

Convolutional neural networks

bg right:30% fit

1958: Hubel and Wiesel experiment on cats and reveal the structure of the visual cortex
Determine that specific neurons react to specific features and receptive fields
Modelled in the “neocognitron” by Kunihiko Fukushima in 1980
LeCun’s work in the 1990s led to modern CNNs

Why CNNs?

A fully connected network has a 1:1 mapping of weights to inputs
Fine for MNIST (28x28) pixels, but quickly grows out of control
❓ If you train a (fully connected) network on 100x100 images, how would you infer on 200x200 images?
❓ What if an object is shifted, rotated, or flipped within the image?

Convolutional layers

A convolution layer is a set of kernels whose weights are learned
Instead of the straight up weighted sum of inputs, the input image is convolved with the learned kernel(s)
The output is often referred to as a feature map
The dimensionality of the feature map is determined by the:
- Size of the input image
- Number of kernels
- Padding (usually “same” or “valid”)
- “Stride”, or shift of the kernel at each step

Dimensionality examples

Input	Kernel	Stride	Padding	Output
$100 \times 100 \times 3$	$5 \times 5 \times 32$	1	same	$100 \times 100 \times 32$
$100 \times 100 \times 1$	$5 \times 5 \times 32$	2	same	$50 \times 50 \times 32$
$100 \times 100 \times 3$	$5 \times 5 \times 32$	1	valid	???

The number of channels has no impact on the depth of the output: the number of kernels determines the depth of the output
The colour channels are convolved independently, then summed

Number of parameters example

Input: $100 \times 100 \times 3$
Kernel: $5 \times 5 \times 32$
Bias terms: 32
Total parameters: $5 \times 5 \times 3 \times 32 + 32 = 2430$

While convolution only happens in 2D, the kernel can be thought of as a 3D volume - there’s a separate trainable kernel for each channel

Pooling layers

Pooling layers are used to reduce the dimensionality of the feature maps (aka downsampling) by taking the maximum or average of a region

center h:300

❓ Why would we want to downsample?

Putting it all together

Backpropagating CNNs

After backpropagating through the fully connected head, we need to backpropagate through the:
- Pooling layer
- Convolution layer
The pooling layer has no learnable parameters, but it needs to remember which input was maximum (all the rest get a gradient of 0)
The convolution layer is trickier math, but it ends up needing another convolution - this time with the kernel transposed

LeNet (1998) vs AlexNet (2012)

h:500 center

Inception v1 aka GoogLeNet (2014)

height:500 center

height:300

Struggled with vanishing gradients
v2 introduced batch normalization

VGG (2014)

center h:400

Simple architecture, but “very” deep (16 or 19 layers)
Fixed the convolution hyperparameters and focused on depth

ResNet (2015)

center

Key innovation: easier to learn “identity” functions ( $f (x) = x$ )
If a layer outputs 0, it doesn’t kill the gradient
Even deeper, e.g. ResNet-152

Transfer Learning

“If I have seen further, it is by standing on the shoulder of giants” – Isaac Newton

Transfer learning copy pastes a trained network into a new task
You can select which layers to keep, which to freeze, and which to re-train
You can also drop new layers on top of the old ones
Most of the time you want to freeze the early layers and add a new “head”

Data Augmentation

Garbage in, garbage out
We can artificially increase diversity with data augmentation:
- Random crops, flips, rotations
- Rescaling/resizing
- Changing colours
AutoAugment does a bunch of this automatically

Keyboard shortcuts

COMP 4630 | Winter 2026