Welcome!

This site contains lecture notes, tutorials, and assignment instructions for COMP 4630: Machine Learning. Use the sidebar on the left to view the material, and if you notice a problem, please report an issue at the GitHub repo.

Sometimes I’ll write stuff on the board. You can find a version of my scribbles here, which I’ll try to keep more or less up to date.

Lecture 1: Data Exploration

HTML Slides | PDF Slides

Welcome to Machine Learning!

COMP 4630 | Winter 2026 Charlotte Curtis

What is this course about?

Continuing the supervised/unsupervised learning algorithms from COMP 3652, with a focus on Neural Networks
First half: the history, theory, and math behind neural networks
Second half: applications of NNs in computer vision, natural language processing, and more

This is not (just) a course on building models using libraries like TensorFlow or PyTorch, it is a course on understanding the theory

How did I get involved with ML?

center w:900px

What do you want to learn about ML?

❓

Grade Assessment

Component	Weight
Assignments	3 x 10%
Midterm (theory) exam	20%
Journal club	10%
Final project	40%

Bonus marks may be awarded for substantial corrections to materials, submitted as pull requests

Course materials repo: https://github.com/mru-comp4630/w26 Rendered at: https://mru-comp4630.github.io/w26/

Textbooks and other readings

Primary Textbook:

More mathy details:

Deep Learning

Journal club list: on D2L under “Course Info” (requires MRU library login)

Generative AI policy

Yes, AI can do a lot of what I’m asking for in this course
No, I do not want to read about what AI “thinks”
❓ What do you think is an appropriate use?

Machine Learning Project Checklist

Appendix A of the hands-on textbook

Frame the problem and look at the big picture.
Get the data.
Explore the data to gain insights.
Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
Explore many different models and short-list the best ones.
Fine-tune your models and combine them into a great solution.
Present your solution.
Launch, monitor, and maintain your system.

1. Look at the big picture

Example Dataset: California housing prices (1990)

❓ Discussion questions:

How does the company expect to use and benefit from this model?
What is the current solution?
What kind of ML task is this?
What kind of performance measure should we use?

Where we left off on Wednesday, January 7

First, some stuff about assessments

Assignment 1
Journal club guidelines
Example of a math-heavy paper
Additional references for papers:
- Google Scholar
- ArXiv

2. Get the data

For this class, we’ll use readily available datasets. Some sources are:

UCI Machine Learning Repository
Kaggle
Google Dataset Search
Various Government open data portals (e.g. Calgary, Alberta, Canada)

After fetching the data, set aside a test set and don’t look at it.

“Get the data” can often be a huge task in itself!

2a. Set aside a test set

❓ Discussion questions:

Why do we need an independent test set?
- Avoid data snooping bias
- Relevant XKCD
Why would we use a random seed?
What is naive about simply selecting a random sample?
What else could we do?
What is stratified sampling?

Side tangent: Sampling bias

Simple example: assume 80% of population likes cilantro
Goal: ensure our sample is representative of the population, $\pm 5%$

The binomial distribution can be used to model the probability of choosing $k$ people who like cilantro from $n$ total participants:

$P (X = k) = (k n) p^{k} (1 - p)^{n - k}, where (k n) = \frac{n !}{k ! ( n - k )!}$

Side tangent: Sampling bias continued

$P (X = k)$ is the probability mass function, and the corresponding cumulative distribution function is just the sum up to $k$ :

$P (X \leq k) = i = 0 \sum k (i n) p^{i} (1 - p)^{n - i}$

Suppose we randomly sample 100 people. What is the probability of fewer than 75 or more than 85 cilantro lovers?

This is also my excuse to review some probability theory and notation

3. Explore the data

❓ Discussion questions:

What do you notice about the data?
Do the values make sense for the labels?
Is the scale of the features comparable? Does this matter?
What possible biases might be present in the data?

3a. Look for correlations

The Pearson correlation coefficient is a measure of the linear correlation between two variables $X$ and $Y$ (commonly denoted as $r$ ):

$r = \frac{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ^{2} \sum _{i = 1}^{n} ( y _{i} - y ˉ ) ^{2}}$

where $\overset{x}{ˉ}$ and $\overset{y}{ˉ}$ are the sample means of $X$ and $Y$ , respectively.

What do correlations of 0, 1, and -1 mean?
What are some limitations of Pearson correlation?

Where we left off on Monday, January 12

4. Prepare the data

General goals:

Handle missing data, and maybe outliers
Drop irrelevant features
Combine features using domain knowledge
Apply various transformations (e.g. scaling, encoding)
Apply scaling when necessary

4a. Handling missing data

In the book 3 options are listed to handle the NaN values:

housing.dropna(subset=["total_bedrooms"], inplace=True) ## option 1
housing.drop("total_bedrooms", axis=1)                  ## option 2
median = housing["total_bedrooms"].median()             ## option 3
housing["total_bedrooms"].fillna(median, inplace=True)

❓ Discussion questions:

What is each option doing?
What are the pros and cons of each option?
Which one should we choose?

4b. Handling non-numeric data

Most of the math in ML algorithms is based on numbers, so we need to convert text and categorical attributes to numbers. This is called encoding.

❓ Discussion questions:

Which columns of our data are categorical?
What methods could we use to convert them to numbers?
What are the assumptions about the various encoding methods?

4c. Scaling the data

Many ML algorithms don’t like features with vastly different scales. Common scaling methods are min-max scaling and standardization.

Important: scaling is computed on the training set and applied to the validation and test sets - they are not scaled independently!

❓ Discussions questions:

What are the bounds of each method?
Which method is more affected by outliers?
How would you decide which method to use?

4e. Standardization details

A general Gaussian distribution is given by:

$f (x) = \frac{1}{σ 2 π} e^{- \frac{1}{2} (\frac{x - μ}{σ})^{2}}$

where $μ$ is the mean and $σ$ is the standard deviation. The standard normal distribution is a special case where $μ = 0$ and $σ = 1$ , reducing the equation to:

$f (x) = \frac{1}{2 π} e^{- \frac{1}{2} x^{2}}$

4f. Other transformations

Log transformation: useful for data that is heavily skewed
Also square root, squaring, etc.: try to remove heavy tails
Feature engineering: combining features to create new ones
Binning: turning continuous data into discrete categories
- Possibly using K-means clustering
- Relies on domain knowledge
Best to create a transformation pipeline and apply it to the data rather than saving the transformed data

Coming up next

Math review:
- Linear algebra
- Differential calculus
- Statistics
A brief introduction to vector calculus

Lecture 2: Math Review

HTML Slides | PDF Slides

Math review

COMP 4630 | Winter 2026 Charlotte Curtis

Math review

MATH 1203: Linear algebra
MATH 1200: Differential calculus
MATH 2234: Statistics

Linear algebra

Vectors are multidimensional quantities (unlike scalars):

$v = v = v_{1} v_{2} ⋮ v_{n}$

A common vector space is $R^{2}$ , or the 2D Euclidean plane. Example:

$v_{1} = [34]$

Vector operations

Addition: $v_{1} + v_{2} = [v_{11} + v_{21} v_{12} + v_{22}]$
Scalar multiplication: $c v = [c v_{1} c v_{2}]$
Dot product: $v_{1} \cdot v_{2} = v_{11} v_{21} + v_{12} v_{22}$ (yields a scalar)
- Can be thought of as the projection of one vector onto another, or how much two vectors are aligned in the same direction

Vector norms

The norm of a vector is a measure of its length
Most common is the Euclidean norm (or $L^{2}$ norm): $∥ v ∥_{2} = ∥ v ∥ = (i = 1 \sum n v_{i}^{2})$
You might also see the $L^{1}$ norm, particularly as a regularization term: $∥ v ∥_{1} = i = 1 \sum n ∣ v_{i} ∣$

Useful vectors

Unit vector: A vector with a norm of 1, e.g. $x = [10]$ , $y = [01]$
Normalized vector: A vector divided by its norm, e.g. $v = \hat{v} = \frac{v}{∥ v ∥}$
Dot product can also be written as $v_{1} \cdot v_{2} = ∥ v_{1} ∥∥ v_{2} ∥ cos (θ)$

Yes, a normalized vector is also a unit vector, main difference is in context and notation

Matrices

A matrix is a 2D array of numbers:

$A = [a_{11} a_{21} a_{12} a_{22} a_{13} a_{23}]$

Notation: Element $a_{ij}$ is in row $i$ , column $j$ , also written as $A_{ij}$ .

Rows then columns! $M \times N$ matrix has $M$ rows and $N$ columns

Matrix operations

Addition: element-wise if dimensions match. $A + B = B + A$
Scalar multiplication: just like vectors
Matrix multiplication: $C = A B$ where the elements of $C$ are: $c_{ij} = k = 1 \sum n a_{ik} b_{kj}$
- Multiply and sum rows of $A$ with columns of $B$
- Usually, $A B \neq = B A$

Matrix multiplication examples

Matrix times a matrix: $A = 21 - 4 035, B = [- 1 1 0317]$

Matrix times a vector: $A = [01 - 1 0], v = [10]$

Where we left off on January 14

Matrix transpose

Transpose: $A^{T}$ swaps rows and columns $A = [1324], A^{T} = [1234]$
Inverse: just as $\frac{1}{x} \cdot x = 1$ , $A^{- 1} A = I$ , where $I$ is the identity matrix $A = [1324], A^{- 1} = [- 2 1.5 1 - 0.5], A^{- 1} A = [1001]$

Not every matrix is invertible!

Calculus: Notation

The derivative of a function $y = f (x)$ is represented as:

$f^{'} (x) = \frac{d y}{d x} = h \to 0 lim \frac{f ( x + h ) - f ( x )}{h}$

The second derivative is denoted:

$f^{''} (x) = \frac{d ^{2} y}{d x ^{2}} = \frac{d}{d x} (\frac{d y}{d x})$

and so on.

Differentiability

bg left fit

For a function to be differentiable at a point $x_{A}$ , it must be:

Defined at $x_{A}$
Continuous at $x_{A}$
Smooth at $x_{A}$
Non-vertical at $x_{A}$

Select rules of differentiation

	Function $f$	Lagrange	Leibniz
Constant	$f (x) = c$	$f^{'} (x) = 0$	$\frac{df}{d x} = 0$
Power	$f (x) = x^{r}$ with $r \neq = 0$	$f^{'} (x) = r x^{r - 1}$	$\frac{df}{d x} = r x^{r - 1}$
Sum	$f (x) = g (x) + h (x)$	$f^{'} (x) = g^{'} (x) + h^{'} (x)$	$\frac{df}{d x} = \frac{d g}{d x} + \frac{d h}{d x}$
Exponential	$f (x) = e^{x}$	$f^{'} (x) = e^{x}$	$\frac{df}{d x} = e^{x}$
Chain Rule	$f (x) = g (h (x))$	$f^{'} (x) = g^{'} (h (x)) h^{'} (x)$	$\frac{df}{d x} = \frac{d g}{d h} \frac{d h}{d x}$

Chain rule example

Find $\frac{df}{d x}$ for $f (x) = σ (x) = \frac{1}{1 + e ^{- x}}$
Now, let, $y = σ (x_{1})$ , where $x_{1} = w x$ . What is $\frac{d y}{d x}$ ?

Partial derivatives

For a scalar valued function $y = f (x_{1}, x_{2})$ , there are two partial derivatives:

$\frac{\partial y}{\partial x _{1}}, \frac{\partial y}{\partial x _{2}}$

These are computed by holding the “other” variable(s) constant. For example, if $y = 2 x_{1} + x_{2} + x_{1} x_{2}$ , then:

$\frac{\partial y}{\partial x _{1}} = 2 + x_{2}, \frac{\partial y}{\partial x _{2}} = 1 + x_{1}$

A brief introduction to vector calculus

Putting together partial derivatives with vectors and matrices we get:

Scalar-valued $f (x)$ :

$\nabla f = \frac{\partial f}{\partial x _{1}} \frac{\partial f}{\partial x _{2}} ⋮ \frac{\partial f}{\partial x _{n}}$

Vector-valued $f (x)$ :

$J_{f} = \nabla^{T} f_{1} \nabla^{T} f_{2} ⋮ \nabla^{T} f_{m} = \frac{\partial f _{1}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}}$

Most of the time we’ll just be working with the gradient

Statistics: Notation

A random variable $x \sim P$ is a variable that can take on random variables according to some probability distribution $P$
$x$ may take on discrete (e.g. dice rolls) or continuous (e.g. age) values
$X$ or $x$ for the random variable and $x$ or $x_{i}$ for a specific value
$P (x)$ for a a discrete distribution and $p (x)$ for continuous
$x_{P} \equiv x \sim P$ and $x_{p} \equiv x \sim p$

Some textbooks/papers/websites use different notation!

Discrete random variables

A discrete probability mass function describes the probability of $x$ taking on a specific value
Example: for a balanced 6-sided die, $P (x = 1) = \frac{1}{6}$
You can add together probabilities, e.g. $P (x \leq 3) = i = 1 \sum 3 P (x = i)$
$x \sum P (x) = 1$ and $P (x_{i}) \geq 0$ for any valid distribution

Continuous random variables

A continuous probability density function gives the probability of being in some tiny interval $δ x$ given by $p (x) δ x$
Example: the uniform distribution, $p (x) = \frac{1}{b - a}$ for $a \leq x \leq b$
$p (x = x_{i}) = 0$ for any specific value $x_{i}$
Need to integrate to get a concrete value, e.g. $p (x \leq a) = \int_{- \infty}^{a} p (x) d x$
$\int_{- \infty}^{\infty} p (x) d x = 1$ and $\int_{a}^{b} p (x) d x \geq 0$ for any valid distribution

Expectation and variance

The expectation or expected value is its average value $E [x]$
$E [x_{P}] = x \sum x P (x)$ and $E [x_{p}] = \int_{- \infty}^{\infty} x p (x) d x$
More generally, for any function $f (x)$ : $E [f (x)] = x \sum f (x) P (x) and \int_{- \infty}^{\infty} f (x) p (x) d x$
The variance describes how much the values vary from their mean: $Var [x] = E [(x - E [x])^{2}]$

Multiple random variables

Joint probability $P (x, y)$ is the probability of $x$ and $y$ occurring together
Conditional probability $P (x = x ∣ y = y)$ is the probability that $x$ takes on value $x$ given that $y = y$ has already happened
In general, $P (x = x ∣ y = y) = \frac{P ( x = x , y = y )}{P ( y = y )}$
For independent variables, $P (x = x ∣ y = y) = P (x = x)$

Covariance

The covariance between $f (x)$ and $g (y)$ gives a sense of how linearly related they are and how much they vary together: $Cov (f (x), g (y)) = E [(f (x) - E [f (x)]) (g (y) - E [g (y)])]$
Related to correlation as $Corr (f (x), g (y)) = \frac{Cov ( f ( x ) , g ( y ))}{Var ( f ( x )) Var ( g ( y ))}$
The covariance matrix of a random vector $x$ is a square matrix where the $(i, j)$ element is the covariance between $x_{i}$ and $x_{j}$
The diagonal of the covariance matrix gives $Var (x_{i})$

The Normal distribution

$N (x; u, σ^{2}) = \frac{1}{2 π σ ^{2}} exp^{(- \frac{1}{2 σ ^{2}} (x - μ)^{2})}$

Good “default choice” for two reasons:

The central limit theorem shows that the sum of many ( $> 30$ ish) independent random variables is normally distributed
Has the most uncertainty of any distribution with the same variance

We can’t easily integrate $N (x)$ , so numerical approximations are used

bg fit

Coming up next

Training (regression) models
- Linear regression
- Gradient descent
References and suggested reading:
- Scikit-learn book:
  - Chapter 4: Training Models
- Deep Learning Book
  - Section 5.1.4: Linear Regression

Lecture 3: Training models

HTML Slides | PDF Slides

Training Models with Regression and Gradient Descent

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Linear Regression and the Normal Equation
Gradient Descent and its various flavours
References and suggested reading:
- Scikit-learn book:
  - Chapter 4: Training Models
- Deep Learning Book
  - Section 5.1.4: Linear Regression

Linear Regression

Unlike most models, linear regression has a closed-form solution called the Normal Equation:

$\hat{θ} = (X^{T} X)^{- 1} X^{T} y$

where

$\hat{θ}$ are the weights of the model minimizing the cost function
$y$ is the vector of target values
$X$ is the design matrix of feature values

As usual, different sources use different notation, e.g. $w$ or $ϕ$ instead of $θ$ .

Consider the 1-d case:

bg right:40% 120%

$\overset{y}{^} = θ_{0} + θ_{1} x$

we want the values of $θ_{0}$ and $θ_{1}$ that minimize the Mean Square Error between the actual and predicted $y$ values:

$MSE MSE = \frac{1}{m} i = 1 \sum m (\overset{y}{^} - y_{i})^{2} = \frac{1}{m} i = 1 \sum m (θ_{0} + θ_{1} x_{i} - y_{i})^{2}$

Solving for $θ_{0}$ and $θ_{1}$

Math time!

Solving for $θ_{0}$ and $θ_{1}$

bg right:40% 120%

After some algebraic gymnastics, we get:

$θ_{1} θ_{0} = \frac{μ _{y} \sum _{m} x _{i} - \sum _{m} x _{i} y _{i}}{μ _{x} \sum _{m} x _{i} - \sum _{m} x _{i}^{2}} = μ_{y} - θ_{1} μ_{x}$

where $μ_{x}$ and $μ_{y}$ are the means of the $x$ and $y$ values, respectively.

Expanding to matrix form

Instead of the scalar $x$ or even vector $x$ , it’s common to use a design matrix $X$ to represent the feature values:

$X = x_{11} x_{21} ⋮ x_{m 1} x_{12} x_{22} ⋮ x_{m 2} \dots \dots ⋱ \dots x_{1 n} x_{2 n} ⋮ x_{mn}$

where each row is an instance (sample) and each column is a feature.

The first column is all ones, representing the bias term

Back to the linear regression problem…

We can rewrite the estimate in matrix notation: $\hat{y} = X θ$
The MSE can be written as:

$MSE = \frac{1}{m} i = 1 \sum m (\overset{y}{^}_{i} - y_{i})^{2} = \frac{1}{m} (X θ - y)^{T} (X θ - y)$

where we’ve used the trick of substituting $a^{T} a = \sum_{i} a_{i}^{2}$

:abacus: Find the gradient of the MSE w.r.t $θ$ , set it to zero, and solve for $θ$

Properties of matrices and their transpose

The following properties are useful for solving linear algebra problems:

$(AB)^{T} = B^{T} A^{T}$
$(A + B)^{T} = A^{T} + B^{T}$
$(A^{- 1})^{T} = (A^{T})^{- 1}$
$(A^{T})^{T} = A$

Additionally, any matrix or vector multiplied by $I$ is unchanged.

The Normal Equation

We made it! The Normal Equation is again:

$\hat{θ} = (X^{T} X)^{- 1} X^{T} y$

No optimization is required to find the optimal $θ$
Limitations:
- $X^{T} X$ must be invertible and small enough to fit in memory
- The computational complexity is (at least) $O (n^{3})$
Even in linear regression problems, it is common to use gradient descent instead due to these limitations

Gradient Descent

The goal of gradient descent is still to minimize the cost function, but it follows an iterative process:

Start with a random $θ$
Calculate the gradient $\nabla_{θ}$ for the current $θ$
Update $θ$ as $θ = θ - η \nabla_{θ}$
Repeat 2-3 until some stopping criterion is met

where $η$ is the learning rate, or the size of step to take in the direction opposite the gradient.

Stochastic Gradient Descent

Standard or batch gradient descent uses the entire training set to calculate the gradient for each instance at every step
Stochastic Gradient Descent uses a single random instance at each step:
1. Start with a random $θ$
2. Pick a random instance $x_{i}$ (row in the design matrix)
3. Calculate the gradient $\nabla_{θ}$ for the current $θ$ and $x_{i}$
4. Update $θ$ as $θ = θ - η \nabla_{θ}$
5. Repeat 2-4 until some stopping criterion is met

Mini-batch Gradient Descent

bg right:40% 110%

Mini-batch gradient descent uses a random subset of the training set
Less chaotic than stochastic, but faster than batch
Most common type of gradient descent used in practice

Gradient Descent Hyperparameters

The learning rate $η$ - size of step taken
No rule that it needs to be constant! A simple learning schedule is to decrease $η$ over time, e.g.: $η = \frac{t _{0}}{t + t _{1}}$ where $t$ is the current iteration and $t_{0}$ and $t_{1}$ are hyper-parameters
For mini-batch, the batch size is another hyper-parameter
The number of epochs, or times to process the entire training set

Stopping Criteria

bg right:40% 110%

The simplest stopping criterion is to set a maximum number of epochs
Early stopping is another option:
- Evaluate on a validation set at regular intervals
- Stop when the validation error starts to increase
The comparison between training and validation performance can also help prevent overfitting

Loss functions

The loss function is the function being minimized by gradient descent
MSE is convex and guaranteed to have a single global minimum, but many other loss functions have multiple local minima
The relative scale of the features can affect the convergence:

Higher-order Polynomials

Higher order polynomials can be solved with the Normal Equation as well: $y = θ_{0} + θ_{1} x + θ_{2} x^{2} + \dots + θ_{n} x^{n}$
Just include the higher order terms in $X$
This is still a linear regression problem because the coefficients are linear!
Risk of overfitting the data
Easy way to regularize: drop one or more of the higher order terms

Regularization

If the model fits the training data too well, but doesn’t generalize to new data, it is overfitting
Regularization imposes additional constraints on the weights
Example: Ridge Regression adds a term to the loss function: $J (θ) = MSE (θ) + α \frac{1}{2} i = 1 \sum n θ_{i}^{2}$ where $α$ is the regularization parameter
The regularization term is only added during training, not evaluation

Note: the term cost function is often used instead of loss function

Logistic regression and beyond

Logistic regression is a binary classifier that uses the logistic function (aka sigmoid function) to map the output to a range of 0 to 1:

$σ (t) = \frac{1}{1 + e ^{- t}}$

We can then minimize the log loss or cross-entropy loss function:

$J (θ) = - \frac{1}{m} i = 1 \sum m [y_{i} lo g (\overset{p}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{p}{^}_{i})]$

where $\overset{p}{^}_{i} = σ (θ^{T} x_{i})$ is the probability that instance $i$ is positive.

The gradient of the log loss ends up being:

$\nabla_{θ} J (θ) = \frac{1}{m} i = 1 \sum m (σ (θ^{T} x_{i}) - y_{i}) x_{i}$

There is no (known) analytical solution this time, but we can still use gradient descent!
In this case it’s still convex, so we don’t have to worry about local minima
In general, for a loss function to work with gradient descent, it must be:
- Continuous and
- Differentiable
- … at the locations where you evaluate it

Next up: Backpropagation!

Lecture 4: Backpropagation

HTML Slides | PDF Slides

Backpropagation

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

A brief review of the history of neural networks
Neurons, perceptrons, and multilayer perceptrons
Backpropagation
References and suggested reading:
- Scikit-learn book: Chapter 10, introduction to artificial neural networks
- Deep Learning Book: Chapter 6, deep feedforward networks

The rise and fall of neural networks

In between each era of excitement and advancement there was an “AI winter”

Model of a neuron

McCulloch and Pitts (1943)
Neuron as a logic gate with time delay
“Activates” when the sum of inputs exceeds a threshold
Non-invertible (forward propagation only)

Threshold Linear Units (TLUs)

Linear I/O instead of binary
Rosenblatt (1957) combined multiple TLUs in a single layer
Physical machine: the Mark I Perceptron, designed for image recognition
Criticized by Minsky and Papert (1969) for its inability to solve the XOR problem - first AI winter

A single threshold logic unit (TLU)

Image source: Scikit-learn book

Training a perceptron

Hebb’s rule: “neurons that fire together, wire together”

$w_{ij}^{(u p d a t e d)} = w_{ij} + η (y_{j} - \overset{y}{^}_{j}) x_{i}$

where $i$ = input, $j$ = output
Fed one instance at a time,
Guaranteed to converge if inputs are linearly separable
:abacus: Simple example: AND gate

A perceptron with two inputs and three outputs

Image source: Scikit-learn book

Multilayer perceptrons (MLPs)

If a perceptron can’t even solve XOR, how can it do higher order logic?
Consider that XOR can be rewritten as:
```
A xor B = (A and !B) or (!A and B)
```
A perceptron can solve and and or and not… so what if the input to the or perceptron is the output of two and perceptrons?

A solution to XOR

h:500 center

Backpropagation

I just gave you the weights to solve XOR, but how do we actually find them?
Applying the perceptron learning rule no longer works, need to know how much to adjust each weight relative to the overall output error
Solution presented in 1986 by Rumelhart, Hinton, and Williams
Key insight: Good old chain rule! Plus some recursive efficiencies

Training MLPs with backpropagation

Initialize the weights, through some random-ish strategy
Perform a forward pass to compute the output of each neuron
Compute the loss of the output layer (e.g. MSE)
Calculate the gradient of the loss with respect to each weight
Update the weights using gradient descent (minibatch, stochastic, etc)
Repeat steps 2-5 until stopping criteria met

Step 4 is the “backpropagation” part

Example: forward pass

With a linear activation function: $\hat{y} = X W^{(1)} W^{(2)}$
In summation notation for a single sample: $\overset{y}{^} = j = 1 \sum 2 w_{j}^{(2)} i = 1 \sum 2 x_{i} w_{ij}^{(1)}$
In this case, $\overset{y}{^} = 2.162$

center

$X = [23], y = 1$

and

$W^{(1)} = [- 0.78 0.85 0.13 0.23], W^{(2)} = [1.8 0.40]$

Example: calculate error and gradient

We never picked a loss function! Let’s assume we’re using MSE
For a single sample: $L (w^{(1)}, w^{(2)}) = \frac{1}{2} (\overset{y}{^} - y)^{2} = \frac{1}{2} (j = 1 \sum 2 w_{j}^{(2)} i = 1 \sum 2 x_{i} w_{ij}^{(1)} - y)^{2}$ with the $1/2$ added for convenience
The goal is to update each weight by a small amount to minimize the loss
Fortunately, we know how to find a small change in a function with respect to one of the variables: the partial derivative!

Recursively applying the chain rule

Weights in the second layer (connecting hidden and output): $\frac{\partial L}{\partial w _{j}^{(2)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial w _{j}^{(2)}} = (\overset{y}{^} - y) \frac{\partial y ^}{\partial w _{j}^{(2)}} = (\overset{y}{^} - y) i \sum x_{i} w_{ij}^{(1)}$
For the first layer (connecting inputs to hidden): $\frac{\partial L}{\partial w _{ij}^{(1)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{h _{j}} \frac{\partial h _{j}}{\partial w _{ij}^{(1)}} = (\overset{y}{^} - y) w_{j}^{(2)} x_{i}$ where $h_{j} = x_{i} w_{ij}^{(1)}$ is the output of the hidden layer

Bias terms

The toy example did not include bias terms, but these are very important (as seen in the perceptron examples)
With a single layer we can add a column of 1s to $X$ , but with multiple layers we need to add bias at every layer
The forward pass becomes: $\hat{y} = (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$
Or in summation form: $\overset{y}{^} = j = 1 \sum 2 w_{j}^{(2)} (i = 1 \sum 2 x_{i} w_{ij}^{(1)} + b^{(1)}) + b^{(2)}$

Gradient with respect to the bias terms

For layer 2 (the output layer):

$\frac{\partial L}{\partial b _{j}^{(2)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial b ^{(2)}} = (\overset{y}{^} - y) (1)$

For layer 1: $\frac{\partial L}{\partial b ^{(1)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{h} \frac{\partial h}{\partial b ^{(1)}} = (\overset{y}{^} - y) i \sum w_{ij}^{(2)}$

where $h = \sum_{i} x_{i} w_{ij}^{(1)} + b^{(1)}$ is the input to the hidden layer

Summary in matrix form

Parameter	Gradient
Weights of layer 2	$\frac{\partial L}{\partial W ^{(2)}} = (\overset{y}{^} - y) (X W^{(1)} + b^{(1)})$
Bias of layer 2	$\frac{\partial L}{\partial b ^{(2)}} = (\overset{y}{^} - y)$
Weights of layer 1	$\frac{\partial L}{\partial W ^{(1)}} = (\overset{y}{^} - y) W^{(2)} X$
Bias of layer 1	$\frac{\partial L}{\partial b ^{(1)}} = (\overset{y}{^} - y) W^{(2)}$

Computational considerations

Many of the terms computed in the forward pass are reused in the backward pass (such as the inputs to each layer)
Similarly, gradients computed in layer $l + 1$ are reused in layer $l$
Typically each intermediate value is stored, but modern networks are big

Model Parameters

Our example 6

AlexNet (2012) 60 million

GPT-3 (2020) 175 billion

Model	Parameters
Our example	6
AlexNet (2012)	60 million
GPT-3 (2020)	175 billion

Choices in neural network design

Activation functions

The simple example used a linear activation function (identity)
To include other activation functions, the forward pass becomes: $\hat{y} = f_{2} (f_{1} (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)})$
The gradient in the output layer becomes: $\frac{\partial L}{\partial W ^{(2)}} = \frac{\partial L}{\partial f _{2}} \frac{\partial f _{2}}{\partial z ^{(2)}} \frac{\partial z ^{(2)}}{\partial W ^{(2)}}$ where $z^{(2)} = f_{1} (X W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$ , or the summation the second layer before applying the activation function
Problem! That step function in the original perceptron is not differentiable

Activation functions

A common early choice was the sigmoid function: $σ (z) = \frac{1}{1 + e ^{- z}}, \frac{d σ}{d z} = σ (z) (1 - σ (z))$
A more computationally efficient choice common today is the “ReLU” (Rectified Linear Unit) function: $ReLU (z) = max (0, z), \frac{d ReLU}{d z} = {01 z < 0 z > 0$

Activation functions in hidden layers

The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles. – Deep Learning Book, Section 6.3

Activation functions in hidden layers serve to introduce nonlinearity
Common for multiple hidden layers to use the same activation function
Sigmoid, ReLU, and tanh (hyperbolic tangent) are common choices
Also “leaky” ReLU, Parameterized ReLU, absolute value, etc
Can be considered a hyperparameter of the network

Loss functions

The choice of loss function is very important!
Depends on the task at hand, e.g.:
- Regression: MSE, MAE, etc
- Classification: Usually some kind of cross-entropy (log likelihood)
May or may not include regularization terms
Must be differentiable, just like the activation functions

Activation functions in the output layer

Activation functions in the output layer should be chosen based on the loss function (and thus the task)
- Regression: linear
- Binary classification: sigmoid
- Multiclass classification: softmax (generalization of sigmoid)
Again, must be differentiable

A complete fully connected network

center h:450

Next up: Classification loss functions and metrics

Lecture 5: Classification

HTML Slides | PDF Slides

Classification loss functions and metrics

COMP 4630 | Winter 2025 Charlotte Curtis

Overview

All the derivation thus far has been for mean squared error
Cross-entropy loss is more appropriate for classification problems
References and suggested reading:
- Scikit-learn book: Chapter 4, training models
- Scikit-learn docs: Log loss
- Deep Learning Book: Sections 3.1, 3.8, and 6.2

Revisiting the expected value

The expected value of some function $f (x)$ when $x$ is distributed as $P (x)$ is given in discrete form as:

$E [f (x)] = x \sum P (x) f (x)$

where the sum is over all possible values of $x$ .

In continuous form, this is an integral:

$E [f (x)] = \int p (x) f (x) d x$

Binary case: Bernoulli distribution

If a random variable $x$ has a $p$ probability of being 1 and a $1 - p$ probability of being 0, then $x$ is distributed as a Bernoulli distribution: $P (x) = p^{x} (1 - p)^{1 - x} = {p 1 - p for x = 1 for x = 0$
The expected value of $x$ is then: $E [x] = x \sum P (x) x = 0 \cdot (1 - p) + 1 \cdot p = p$

Information theory

Originally developed for message communication, with the intuition that less likely events carry more information, defined for a single event as:

$I (x) = - lo g P (x)$

h:350 center

Entropy

bg right:35% fit

We can measure the expected information of a distribution $P (x)$ as: $H (X) = E [I (x)] = - E_{x \sim P} [lo g P (x)]$
This is called the Shannon entropy
Measured in bits (base 2) or nats (base $e$ )
:abacus: Find the entropy of a bernoulli distribution

Cross-entropy

The KL divergence is a measure of the extra information needed to encode a message from a true distribution $P (x)$ using an approximate distribution $Q (x)$ : $D_{K L} (P ∣∣ Q) = E_{x \sim P} [lo g \frac{P ( x )}{Q ( x )}] = E_{x \sim P} [lo g P (x) - lo g Q (x)]$
The cross-entropy is a simplification that drops the term $lo g P (x)$ : $H (P, Q) = - E_{x \sim P} [lo g Q (x)]$
Minimizing the cross-entropy is equivalent to minimizing the KL divergence
If $P (x) = Q (x)$ , then $D_{K L} (P ∣∣ Q) = 0$ and $H (P, Q) = H (P)$

Cross-entropy loss

bg right:40% fit

For a true label $y \in {0, 1}$ and predicted $\overset{p}{^} \in [0, 1]$ , the cross-entropy loss is:

$L (y, \overset{p}{^}) = = - E_{y} [lo g P (x)] - y lo g \overset{p}{^} - (1 - y) lo g (1 - \overset{p}{^})$

where $\overset{p}{^} = σ (w^{T} h + b)$ is the output of the final layer of a neural network (thresholded to obtain the prediction $\overset{y}{^}$ )

This is also called log loss or binary cross-entropy

Terminology for evaluation

True positive: predicted positive, label was positive ( $TP$ ) ✔️
True negative: predicted negative, label was negative ( $TN$ ) ✔️
False positive: predicted positive, label was negative ( $FP$ ) ❌ (type I)
False negative: predicted negative, label was positive ( $FN$ ) ❌ (type II)
Accuracy is the fraction of correct predictions, given as:

$accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

Precision and recall

Precision: Out of all the positive predictions, how many were correct? $precision = \frac{TP}{TP + FP}$
Recall: Out of all the positive labels, how many were correct? $recall = \frac{TP}{TP + FN}$
Specificity: Out of all the negative labels, how many were correct? $specificity = \frac{TN}{TN + FP}$

Confusion matrix

	Predicted Positive	Predicted Negative
True Positive	TP	FN
True Negative	FP	TN

The axes might be reversed, but a good predictor will have strong diagonals
There’s also the F1 score, or harmonic mean of precision and recall: $F 1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$

ROC Curves

The receiver operating characteristic curve is a plot of the true positive rate (recall or sensitivity) vs. false positive rate (1 - specificity) as the detection threshold changes
The diagonal is the same as random guessing
A perfect classifier would hug the top left corner

Fun fact: the name comes from WWII radar operators, where true positives were airplanes and false positives were noise

Which classifier is better?

center

Multiclass case

For $K$ classes, the output is a vector $\hat{p}$ with $\overset{p}{^}_{i} = P (y = i ∣ x)$
The cross-entropy loss is then: $L (y, \hat{p}) = - i = 1 \sum K y_{i} lo g \overset{p}{^}_{i}$
For a one-hot encoded vector $y$ , this simplifies to: $L (y, \hat{p}) = - lo g \overset{p}{^}_{k}$ where $k$ is the index of the true class

The softmax function

For binary classification, the sigmoid function $σ (z) = \frac{1}{1 + e ^{- z}}$ is used to predict the probability of the positive class
For multiclass classification, the softmax function is used:

$\overset{p}{^}_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{K} e ^{z_{j}}}$

where $z_{i} = w_{i}^{T} h + b_{i}$ is the output of neuron $i$ in the final layer before the activation function is applied
This means that $K$ neurons are needed in the final layer, one for each class

Next up: Convolution and NN frameworks

Lecture 6: Modern Neural Networks

HTML Slides | PDF Slides

Intro to modern neural networks

COMP 4630 | Winter 2024 Charlotte Curtis

Overview

More decisions when making a neural network
- Weight initialization
- Number of neurons and layers
- Optimization algorithms
References and suggested reading:
- Scikit-learn book: Chapters 10-11
- Deep Learning Book: Chapter 8
- Understanding Deep Learning: Chapter 7

Revisiting Backpropagation

For a network with $l$ layers, the gradients of the loss function with respect to the weights in the last layer are given by: $\frac{\partial L}{\partial W ^{(l)}} = \frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial f ^{(l)}} \frac{\partial f ^{(l)}}{\partial z ^{(l)}} \frac{\partial z ^{(l)}}{\partial W ^{(l)}}$

assuming that the output $\overset{y}{^} = f^{(l)} (z^{(l)})$ is a function of layer $l$ ’s input $z^{(l)} = W^{(l)} f^{(l - 1)} (z^{(l - 1)}) + b^{(l)}$ .
At layer $l - 1$ , the gradients are computed as: $\frac{\partial L}{\partial W ^{(l - 1)}} = (\frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial f ^{(l)}} \frac{\partial f ^{(l)}}{\partial z ^{(l)}}) \frac{\partial z ^{(l)}}{\partial f ^{(l - 1)}} \frac{\partial f ^{(l - 1)}}{\partial z ^{(l - 1)}} \frac{\partial z ^{(l - 1)}}{\partial W ^{(l - 1)}}$

At layer $l - 2$ , this becomes: $\frac{\partial L}{\partial W ^{(l - 2)}} = (\frac{\partial L}{\partial y ^} \frac{\partial y ^}{\partial f ^{(l)}} \frac{\partial f ^{(l)}}{\partial z ^{(l)}} \frac{\partial z ^{(l)}}{\partial f ^{(l - 1)}} \frac{\partial f ^{(l - 1)}}{\partial z ^{(l - 1)}}) \frac{\partial z ^{(l - 1)}}{\partial f ^{(l - 2)}} \frac{\partial f ^{(l - 2)}}{\partial z ^{(l - 2)}} \frac{\partial z ^{(l - 2)}}{\partial W ^{(l - 2)}}$
And so on, until we reach the first layer.
We are recursively applying the chain rule and re-using the gradients computed at the previous layer
This is great for computational efficiency, but it can also lead to vanishing or exploding gradients

Vanishing and Exploding Gradients

Vanishing/exploding gradients are where the gradients become near zero or near infinity as they are propagated back through the network
Particularly problematic for recurrent neural networks, where the same weights are multiplied by themselves repeatedly
Also a problem for very deep networks, and part of the reason that deep learning was not popular until the 2010s
❓ What changed?

Consider the variance

At the input layer, $Z^{(0)} = W^{(0)} X + b^{(0)}$ , and $X$ has some variance $σ^{2}$
Assume $W^{(0)}$ and $b^{(0)}$ are initialized to 0
:abacus: What is the variance of $Z^{(0)}$ ?
What about after the activation function $f^{(0)} (Z^{(0)})$ ?

Initialization strategies

In 2010, Glorot and Bengio proposed the Xavier initialization for a layer with $m$ inputs and $n$ outputs: $W_{i, j} \sim U (- \frac{6}{m + n}, \frac{6}{m + n})$
Goal is to preserve the variance of the input and output in both directions
Similar to LeCun initialization, and apparently an overlooked feature of networks from the 1990s

Initialization for ReLU

Glorot initialization was derived under the assumption of linear activation functions (even though they knew this wasn’t the case)
In 2015, He et al. proposed the He initialization specifically for ReLU activations: $W_{i, j} \sim N (0, \frac{2}{m})$
The choice of normal vs uniform is apparently not very important
Default in PyTorch is $U (- 1/ k, 1/ k)$

Batch normalization

Also in 2015, Ioffe and Szegedy proposed batch normalization as a way to mitigate vanishing/exploding gradients
This is simply a normalization at each layer, shifting and scaling the inputs to have a mean of 0 and a variance of 1 (across the batch)
A moving average of the mean and variance is maintained during training, and used for normalization during inference
It also ends up acting as regularization, magic!
❓ Why wouldn’t you want to use batch normalization?

RELU and its variants

In early works, the sigmoid or tanh functions were popular
Both have a small range of non-zero gradients
ReLU has a stable gradient for positive inputs, but can lead to the dying ReLU problem whereby certain neurons are “turned off”
❓ How can we prevent dying ReLUs?

Note: this may not be a problem, and ReLU is cheap. Don’t optimize prematurely unless you’re seeing lots of “dead” neurons.

Number of neurons and layers

Number of neurons in the input layer is defined by number of features
Number of neurons in the output layer is defined by prediction task
In between is a design choice
Common early choice was a pyramid shape, but it turns out that a stack of layers with the same number of neurons works well too
Deeper networks can solve more complex problems with the same number of total parameters, but are also prone to vanishing/exploding gradients

Ultimately, yet another a hyperparameter to be tuned

Optimization algorithms: variations on gradient descent

Gradient descent takes small regular steps, constant or otherwise
Many variations exist! For example, momentum keeps track of the previously computed gradient and uses it to inform the new step: $m W = β m - η \nabla_{W} J (W) = W + m$ where $β$ is a hyperparameter between 0 and 1
Adaptive moment estimation (Adam) is a popular choice that adds on an exponentially decaying average of the squared gradients

The Adam optimizer

Keeps track of first ( $s$ ) and second ( $r$ ) moments of the gradient, with two exponential decay terms $ρ_{1}$ and $ρ_{2}$
At each time step, the update is now: $s r W = ρ_{1} s + (1 - ρ_{1}) \nabla_{W} J (W) = ρ_{2} r + (1 - ρ_{2}) \nabla_{W} J (W)^{2} = W - η \frac{s}{r + ϵ}$
$ρ_{1}$ and $ρ_{2}$ are typically 0.9 and 0.999, respectively

Regularization via dropout

Dropout is a regularization technique that randomly sets a fraction of the neurons to zero during training
During each training pass, a neuron has a $p$ probability of being dropped
Similar to training an ensemble of models then bagging
Helps to prevent overfitting, but can slow down training
Typical values: 0.5 for hidden layers, 0.2 for input layers

Choices for starting

From the Deep Learning Book, section 11.2:

“Depending on the complexity of your problem, you may even want to begin without using deep learning.”
Tabular data: fully connected, images: convolutional, sequences: recurrent*
ReLU or variants, with He initialization
SGD or Adam, add batch normalization if unstable
Use some kind of regularization, such as dropout

Implementation time

Lecture 7: Convolutional Neural Networks

HTML Slides | PDF Slides

Convolution and Convolutional Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Convolutional neural networks (CNNs) are a type of neural network that is particularly well-suited to image data
Before we can understand CNNs, we need to understand convolution
References and suggested reading:
- Scikit-learn book: Chapter 14
- Deep Learning Book: Chapter 9
- 3blue1brown video: What is convolution?

Convolution

Convolution is defined as: $(f * g) (t) = \int_{- \infty}^{\infty} f (τ) g (t - τ) d τ$
Or in the discrete case: $(f * g) [n] = m = - \infty \sum \infty f [m] g [n - m]$
Can be though of as “flipping” one function and sliding it over the other, multiplying and summing at each point

Example

We’re in a hospital dealing with an outbreak. For the first 5 days we have 1 patient on Monday, 2 on Tuesday, etc:

$patients (x) = [1, 2, 3, 4, 5]$

Fortunately, we know how to treat them: 3 doses on day 1, then 2, then 1:

$doses (x) = [3, 2, 1]$

And after 3 days they’re cured.

How many doses do we need on each day?

Convolution in 2D

Extending to 2D basically adds another summation/integration: $(f * g) [n, m] = (f * g) (x, y) = i = 0 \sum n j = 0 \sum m f [i, j] g [n - i, m - j] \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f (u, v) g (x - u, y - v) d u d v$
This can also be extended to higher dimensions
Caution: a colour image is a 3D array, not a 2D array
For typical image processing applications, the colour channels are convolved independently such that the output is still a 3D array

Convolution kernels

Typically there is a small kernel that is convolved with the input
This is just the smaller of the two functions in the convolution
❓ What happens at the edges of the input?

center

Some common kernels

Averaging: $\frac{1}{9} 111111111$
Differentiation: $- 1 - 1 - 1 000111$

Sizes are commonly chosen to be 3x3, 5x5, 7x7, etc.
❓ Why divide by 9?
❓ Why odd sizes?
❓ What effect do you think these kernels will have on an image?

A side tangent on frequency representation

Any signal can be represented as a weighted summation of sinusoids
For a discrete signal $x [n]$ , you can think of this as: $x [n] = k = 0 \sum N - 1 [a_{k} cos (\frac{2 πkn}{N}) + b_{k} sin (\frac{2 πkn}{N})]$
Or, using Euler’s formula $e^{j θ} = cos θ + j sin θ$ : $x [n] = k = 0 \sum N - 1 c_{k} e^{j \frac{2 πkn}{N}}$ where the complex coefficients $c_{k} = a_{k} + j b_{k}$

Fourier Transform

To figure out what the coefficients $c_{k}$ are, we can use the Discrete Fourier Transform (DFT): $X [k] = n = 0 \sum N - 1 x [n] e^{- j \frac{2 πkn}{N}}$ where each element of $X [k]$ is the coefficient $c_{k}$ for frequency $k$
The Fast Fourier Transform (FFT) computes the DFT in $O (n lo g n)$ time
Convolution is $O (n^{2})$

Convolution is multiplication in frequency

Sometimes it is useful to use the Fourier Transform to obtain a frequency representation of an image (or signal)
In the frequency domain, convolution is simply element-wise multiplication
This allows for some efficient operations such as blurring/sharpening an image, as well as some fancy stuff like deconvolution
Sharp edges in space become ringing in frequency, and vice versa
❓ why do CNNs operate in the spatial domain?

Convolutional neural networks

bg right:30% fit

1958: Hubel and Wiesel experiment on cats and reveal the structure of the visual cortex
Determine that specific neurons react to specific features and receptive fields
Modelled in the “neocognitron” by Kunihiko Fukushima in 1980
LeCun’s work in the 1990s led to modern CNNs

Why CNNs?

A fully connected network has a 1:1 mapping of weights to inputs
Fine for MNIST (28x28) pixels, but quickly grows out of control
❓ If you train a (fully connected) network on 100x100 images, how would you infer on 200x200 images?
❓ What if an object is shifted, rotated, or flipped within the image?

Convolutional layers

A convolution layer is a set of kernels whose weights are learned
Instead of the straight up weighted sum of inputs, the input image is convolved with the learned kernel(s)
The output is often referred to as a feature map
The dimensionality of the feature map is determined by the:
- Size of the input image
- Number of kernels
- Padding (usually “same” or “valid”)
- “Stride”, or shift of the kernel at each step

Dimensionality examples

Input	Kernel	Stride	Padding	Output
$100 \times 100 \times 3$	$5 \times 5 \times 32$	1	same	$100 \times 100 \times 32$
$100 \times 100 \times 1$	$5 \times 5 \times 32$	2	same	$50 \times 50 \times 32$
$100 \times 100 \times 3$	$5 \times 5 \times 32$	1	valid	???

The number of channels has no impact on the depth of the output: the number of kernels determines the depth of the output
The colour channels are convolved independently, then summed

Number of parameters example

Input: $100 \times 100 \times 3$
Kernel: $5 \times 5 \times 32$
Bias terms: 32
Total parameters: $5 \times 5 \times 3 \times 32 + 32 = 2430$

While convolution only happens in 2D, the kernel can be thought of as a 3D volume - there’s a separate trainable kernel for each channel

Pooling layers

Pooling layers are used to reduce the dimensionality of the feature maps (aka downsampling) by taking the maximum or average of a region

center h:300

❓ Why would we want to downsample?

Putting it all together

Backpropagating CNNs

After backpropagating through the fully connected head, we need to backpropagate through the:
- Pooling layer
- Convolution layer
The pooling layer has no learnable parameters, but it needs to remember which input was maximum (all the rest get a gradient of 0)
The convolution layer is trickier math, but it ends up needing another convolution - this time with the kernel transposed

LeNet (1998) vs AlexNet (2012)

h:500 center

Inception v1 aka GoogLeNet (2014)

height:500 center

height:300

Struggled with vanishing gradients
v2 introduced batch normalization

VGG (2014)

center h:400

Simple architecture, but “very” deep (16 or 19 layers)
Fixed the convolution hyperparameters and focused on depth

ResNet (2015)

center

Key innovation: easier to learn “identity” functions ( $f (x) = x$ )
If a layer outputs 0, it doesn’t kill the gradient
Even deeper, e.g. ResNet-152

Transfer Learning

“If I have seen further, it is by standing on the shoulder of giants” – Isaac Newton

Transfer learning copy pastes a trained network into a new task
You can select which layers to keep, which to freeze, and which to re-train
You can also drop new layers on top of the old ones
Most of the time you want to freeze the early layers and add a new “head”

Data Augmentation

Garbage in, garbage out
We can artificially increase diversity with data augmentation:
- Random crops, flips, rotations
- Rescaling/resizing
- Changing colours
AutoAugment does a bunch of this automatically

Next up: RNNs

Lecture 8: Recurrent Neural Networks

HTML Slides | PDF Slides

Recurrent Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Dealing with sequence data
Feedforward vs recurrent networks
References and suggested reading:
- Scikit-learn book: Chapter 15
- Deep Learning Book: Chapter 10

Sequence data

So far we’ve been talking about images, tabular data, and other “static” data
❓ What are some examples of sequence data?

$y [n] = {\frac{y [ n - 1 ]}{2} 3 y [n - 1] + 1 if n is even if n is odd$

h:350

Non-RNN Approaches

bg right fit

As usual, you don’t always need a deep learning solution :hammer:

❓ What is an example of a “naive” approach?
❓ What are some limitations of naive approaches?

Autoregressive Moving Average

Models to predict time series with a weighted average of past value $\overset{y}{^} = i = 1 \sum p α_{i} y_{t - i} + i = 1 \sum q θ_{i} ϵ_{t - i}$ where $ϵ_{t} = y_{t} - \overset{y}{^}_{t}$
Key assumption: data is stationary (mean and variance don’t change)
ARIMA adds on “integration” or “differencing” to account for trends

ARIMA

Autoregressive parameter $p$ : How many steps back to average?
Moving average parameter $q$ : How many previous errors to average?
Integrative parameter $d$ : How many “differencing” rounds to perform before applying ARMA?

$d$ can be thought of as approximating the $d^{t h}$ order polynomial

Trends, Seasonality, and Assumptions

❓ Are there any obvious trends in the data?
❓ What about non-obvious trends?
❓ How might this dataset be treated differently from the previous one?

Feedforward vs recurrent networks

Feedforward: data flows in one direction (then backpropagated)
Recurrent: data can flow in loops

Recurrent layers

The simplest recurrent layer has a single feedback connection $\hat{y}_{t} = f (W_{x}^{T} x_{t} + W_{\overset{y}{^}}^{T} \hat{y}_{t - 1} + b)$ where $f$ is the activation function and $W_{x}$ and $W_{\overset{y}{^}}$ are weight matrices
“Backpropagation through time” (BPTT) is exactly the same as regular backpropagation through the unrolled network
❓ What kind of issues might arise during training?
❓ What are some limitations of this approach?
❓ How can we deal with $y_{t - 1}$ for $t = 0$ ?

Preparing data for RNNs

The data format depends on the task, e.g. do you want to predict:
- The next value in a sequence (e.g. predictive text)
- The next $n$ values in a sequence (e.g. stock prices)
- The next sequence in a set of sequences (e.g. language translation)
Let’s start with predicting the next value in a sequence

center

Activation Functions for RNNs

The default activation function in tensorflow/PyTorch is tanh
❓ What is different about RNNs that might influence the choice of activation function?
❓ How might we normalize sequence data?

Beyond the “next value”

Option 1: Use the single-prediction RNN repeatedly
Option 2: Train the RNN to predict multiple values at once
- Easy change model-wise, but data preparation is trickier
- n inputs, n outputs
Option 3: Use a “sequence to sequence” model
- Even trickier data preparation, but n inputs are predicted at each time step instead of just at the end

Seq2seq input/target examples

$n$	Input	Target
1	`[0, 1, 2]`	`[1, 2, 3]`
2	`[0, 1, 2]`	`[[1, 2], [2, 3], [3, 4]]`
3	`[0, 1, 2]`	`[[1, 2, 3], [2, 3, 4], [3, 4, 5]]`

Problems with long sequences

Gradient vanishing/exploding
- Choose activation functions and initialization carefully
- Consider “Layer normalization” (across features)
“Forgetting” early data
- Skip connections through time
- “Leaky” RNNs
- Long short-term memory (LSTM)
Computational efficiency and memory constraints
- Gated recurrent units (GRUs)

Skip connections and leaky RNNs

Simple way of preserving earlier data:
Vanilla RNN: $h^{(t)}$ depends on $h^{(t - 1)}$ only
Skip connection: $h^{(t)}$ depends on $h^{(t - 1)}$ , $h^{(t - 2)}$ , $h^{(t - n)}$ , etc.
Leaky RNN has a smooth “self-connection” to dampen the exponential: $h^{(t)} = α h^{(t - 1)} + (1 - α) h^{(t)}$
Not common approaches anymore, as LSTM, GRU, and especially attention mechanisms are more popular

Long Short-Term Memory (LSTM)

h:500 center

Gated Recurrent Units (GRUs)

h:500 center

Next up: Natural Language Processing

Preview: Natural Language Processing

Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language.
RNNs are widely used in NLP tasks such as language modeling, machine translation, sentiment analysis, and text generation.
Language modeling involves predicting the next word in a sequence of words, which can be done using RNNs.
Machine translation uses RNNs to translate text from one language to another.
Sentiment analysis aims to determine the sentiment or emotion expressed in a piece of text, and RNNs can be used for this task.
Text generation involves generating new text based on a given input, and RNNs are commonly used for this purpose.

Preview: Natural Language Processing

What is Natural Language Processing (NLP)?
Common NLP tasks:
- Language modeling
- Machine translation
- Sentiment analysis
- Text generation
How RNNs are applied in NLP

Preview: Natural Language Processing

NLP in 2026 is dominated by large language models (LLMs) like GPT-4o, Claude, and Gemini
Transformer-based architectures have largely replaced RNNs for most NLP tasks
Key capabilities of modern NLP systems:
- Multi-modal understanding (text, images, audio, video)
- Long-context reasoning (millions of tokens)
- Agentic behaviour: tool use, planning, and self-correction
❓ If transformers have replaced RNNs, why are we still studying them?

Lecture 9: Natural Language Processing

HTML Slides | PDF Slides

Natural Language Processing

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Text to tokens
Tokens to embeddings
Embeddings to predictions
References and suggested reading:
- Scikit-learn book: Chapter 16
- Deep Learning Book: Chapter 12

Tokenization

Consider the sentence:

“The cat sat on the mat.”
This can be split up into individual words or tokens:

[“The”, “cat”, “sat”, “on”, “the”, “mat”, “.”]
❓ what other ways could we tokenize this sentence?
❓ what about punctuation, capitalization, etc.?

RNN + tokens: predict the next character

Just like predicting the next day’s weather or stock price, we can predict the next character in a sentence using an RNN
Input tokens: ['T', 'h', 'e', ' ', 'c', 'a', 't', ' ', 's', 'a', 't', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'm', 'a', 't', '.']
Numeric representation: [20, 8, 5, 0, 3, 1, 20, 0, 19, 1, 20, 0, 15, 14, 0, 20, 8, 5, 0, 13, 1, 20, 2]
We could train an RNN model to predict the next character
For more info check out Andrej Karpathy’s blog post, one of the sources for the Scikit-learn chapter

Repeatedly predicting the next character

To predict whole sentences from a starting point, we can predict the next character and append it to the input, then predict again
In practice this tends to get stuck in loops: Input: "to be or not" Output: "to be or not to be or not to be or not..."
❓ how might we avoid this?
❓ could we just predict the next whole word instead?

Controlled chaos: softmax temperature

The softmax is defined as: $softmax (z)_{i} = \frac{exp ( z _{i} )}{\sum _{j = 1}^{n} exp ( z _{j} )}$ where $z$ is a vector of logits, or log probabilities
This estimates the probability of class $i$ out of $n$ classes
Adding a temperature parameter $T$ : $softmax (z / T)_{i} = \frac{exp ( z _{i} / T )}{\sum _{j = 1}^{n} exp ( z _{j} / T )}$

Temperature Example

Vocab = ["to", "be", "or"]
Assume that the “logits” $z_{i} = [2.1, 1.0, 0.5]$
Sample the next word from the resulting distribution

alt text

In the beginning, there were $n$ -grams

A simple way to represent text is as a bag of $n$ -grams
unigram: single words (aka “Bag of Words”):

[“the”, “cat”, “sat”, “on”, “the”, “mat”]
bigram: pairs of words:

[“the cat”, “cat sat”, “sat on”, “on the”, “the mat”]
trigram: triples of words:

[“the cat sat”, “cat sat on”, “sat on the”, “on the mat”]

Predictive text with $n$ -grams

Given a sequence of tokens, we can predict the probability of the $n$ th token given the previous $n - 1$ tokens: $P (x_{1}, \dots, x_{τ}) = P (x_{1}, \dots, x_{τ - 1}) t = n \prod τ P (x_{t} ∣ x_{t - n + 1}, \dots, x_{t - 1})$
Each of these conditional probabilities can be estimated from the frequency of the $n$ -grams in a corpus
The most likely next word is the one with the highest probability
❓ What are some limitations of this approach?

$n$ -grams challenges

$n$ -grams lose the meaning of words:
- “The cat sat on the mat”
- “The dog sat on the rug”
Also subject to the curse of dimensionality
- Vocabulary $V$ with size $∣ V ∣$ leads to $∣ V ∣^{n}$ possible $n$ -grams
- Most $n$ -grams will not be present in the corpus!
❓ Can you think of an $n$ -gram modification that could help this problem?

Side note: The curse of dimensionality

Data is often represented as $n$ samples with $p$ features each
As $p$ increases, the number of samples required to cover the space increases exponentially
Also called $p ≫ n$ problem

Word embeddings

Alternative solution: represent individual words as vectors, or embeddings
❓ How are these embeddings defined?

Word Embedding

cat [0.2, 0.3, 0.5]

dog [0.1, 0.4, 0.4]

mat [0.5, 0.2, 0.2]

rug [0.4, 0.1, 0.1]

Word	Embedding
cat	`[0.2, 0.3, 0.5]`
dog	`[0.1, 0.4, 0.4]`
mat	`[0.5, 0.2, 0.2]`
rug	`[0.4, 0.1, 0.1]`

Learning word embeddings

Wednesday we’ll discuss the influential Word2Vec paper, but it wasn’t the first time embeddings were learned as part of a network
The concept was first presented successfully by Bengio in 2001

So we have embeddings, now what?

We can use these embeddings as input to a neural network
Applications:
- Sentiment analysis: is a review/tweet/comment positive or negative?
- Named entity recognition: who/what is mentioned in a text?
- Machine translation: convert text from one language to another
- Predictive text: what word comes next?
- Text generation: create new text based on a given input

Sentiment analysis

General process:

Standardize and tokenize the text
Add an embedding layer (trainable or pre-trained)
Add a recurrent layer, such as a GRU
Add a dense layer with sigmoid activation

To Colab!

This is the process you’ll be following for Assignment 3

Sequence to Sequence models

Back to RNNs

RNNs predict the future based on the past
This is exactly what we want for predicting stock prices, weather, etc
❓ What about translating a sentence from one language to another?

Time flies like an arrow; fruit flies like a banana.
❓ Can you think of a way to get RNNs to see the future?

Bidirectional RNNs

Simple approach: just reverse the sequence

$h_{t} g_{t} \hat{y}_{t} = W_{x}^{(f) T} x_{t} + W_{h}^{T} h_{t - 1} + b^{(f)} = W_{x}^{(b) T} x_{t} + W_{g}^{T} g_{t + 1} + b^{(b)} = h_{t} + g_{t}$

bg right fit

Pretraining

Embeddings like Word2Vec have been trained on large corpora
Surely this provides a great starting point for our models!
- ❓ what are some potential drawbacks?
ELMo was introduced in 2018 specifically to address the limitations of Word2Vec and GloVe (another popular embedding)

“Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. We use vectors derived from a bidirectional LSTM that is trained with a coupled language model objective on a large text corpus” – Peters et al

Subword Tokenization

Word embeddings are great, but still have limitations
ELMo uses character tokenization to handle out-of-vocabulary words
In between characters and words are subwords
- "This warm weather is enjoyable"
- "This", "warm", "weath", "er", "is", "enjoy", "able"
Byte Pair Encoding is the most common subword tokenization method, used by GPT and BERT
❓ What are some advantages of subword tokenization?

Machine Translation

English	Spanish
My mother did nothing but weep	Mi madre no hizo nada sino llorar
Croatia is in the southeastern part of Europe	Croacia está en el sudeste de Europa
I would prefer an honorable death	Preferiría una muerte honorable
I have never eaten a mango before	Nunca he comido un mango

❓ What kind of challenges can you think of?

Encoder-Decoder Models

RNNs can convert an arbitrary length sequence into a fixed length vector
RNNs can convert a fixed length vector into an arbitrary length sequence
Why not use two RNNs to convert a sequence to a sequence?
The output head is a softmax layer with one node for each word in the target vocabulary $V$

h:600 center

Teacher Forcing

This model uses teacher forcing to train the decoder
It feels like cheating, but this involves feeding the correct output to the decoder at each time step
This speeds up training and can improve performance
Avoids the whole backpropagation through time thing and makes training of RNNs parallelizable
❓ What are the implications at inference time?

Coming up next: Attention mechanisms and Transformers

Lecture 10: Transformers

HTML Slides | PDF Slides

Transformers and Large Language Models

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Attention mechanisms
Transformers and large language models
- Multi-head attention, positional encoding, and magic
- BERT, GPT, Llama, etc.
References and suggested readings:
- Scikit-learn book: Chapter 16
- d2l.ai: Chapter 11

Attention Overview

bg right fit

Basically a weighted average of the encoder states, called the context
Weights $α$ usually come from a softmax layer
❓ what does the use of softmax tell us about the weights?

Encoder-Decoder with Attention

bg left fit

The alignment model is used to calculate attention weights
The context vector changes at each time step!

Attention Example

bg right fit

Attention weights can be visualized as a heatmap
❓ What can you infer from this heatmap?

The math version

The context vector $c_{t}$ at each step $t$ is computed from the alignment weights as:

$c_{t} = i = 1 \sum n α_{t i} y_{i}$

where $y_{i}$ is the encoder output at step $i$ and $α_{t i}$ is computed as:

$α_{t i} = \frac{exp ( e _{t i} )}{\sum _{k = 1}^{n} exp ( e _{t k} )}$

where $e_{t i}$ is the alignment score or energy between the decoder hidden state at step $t$ and the encoder output at step $i$ .

Different kinds of attention

The original Bahdanau attention (2014) model: $e_{ij} = v_{a}^{T} tanh (W_{a} h_{i - 1} + U_{a} y_{j})$ where $v_{a}$ , $W_{a}$ , and $U_{a}$ are learned parameters
Luong attention (2015), where the encoder outputs are multiplied by the decoder hidden state (dot product) at the current step $e_{t i} = h_{t}^{T} y_{i}$
❓ What might be a benefit of dot-product attention?

Attention is all you need

bg right:30% 90%

Some Googlers wanted to ditch RNNs
If they can eliminate the sequential nature of the model, they can parallelize training on their massive GPU (TPU) clusters
Problem: context matters!
❓ How can we preserve order, but also parallelize training?

Transformers

bg left fit

Still encoder-decoder and sequence-to-sequence
$N$ encoder/decoder layers
New stuff:
- Multi-head attention
- Positional encoding
- Skip (residual) connections and layer normalization

Multi-head attention

bg right:30% 80%

Each input is linearly projected $h$ times with different learned projections
The projections are aligned with independent attention mechanisms
Outputs are concatenated and linearly projected back to the original dimension
Concept: each layer can learn different relationships between tokens

What are V, K, and Q?

Attention is described as querying a set of key-value pairs
Kind of like a fuzzy dictionary lookup $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$
$Q$ is an $n_{q u er i es} \times d_{k eys}$ matrix, where $d_{k eys}$ is the dimension of the keys
$K$ is an $n_{k eys} \times d_{k eys}$ matrix
$V$ is an $n_{k eys} \times d_{v a l u es}$ matrix
The $Q K^{T}$ product is $n_{q u er i es} \times n_{k eys}$ , representing the alignment score between queries and keys (dot product attention)

The various (multi) attention heads

Encoder self-attention (query, key, value are all from the same sequence)
- Learns relationships between input tokens (e.g. English words)
Decoder masked self-attention:
- Like the encoder, but only looks at previously generated tokens
- Prevents the decoder from “cheating” by looking at future tokens
Decoder cross-attention:
- Decodes the output sequence by “attending” to the input sequence
- Queries come from the previous decoder layer, keys and values come from the encoder (like prior work)

Positional encoding

By getting rid of RNNs, we’re down to a bag of words
Positional encoding re-introduces the concept of order
Simple approach for word position $p$ and dimension $i$ : $PE (p, 2 i) PE (p, 2 i + 1) = sin (\frac{p}{1000 0 ^{2 i / d}}) = cos (\frac{p}{1000 0 ^{2 i / d}})$
resulting vector is added to the input embeddings
❓ Why sinusoids? What other approaches might work?

Other positional encodings

Appendix D of the new version of Hands-on Machine Learning goes into detail about positional encodings, particularly for very long sequences
Sinusoidal encoding is no longer used in favour of relative approaches
- Introduces locality bias
- Removes early-token bias
Example: learnable bias $b_{i - j}$ for position $i, j$ in $Q K^{T}$ , clamped to a max
Only $2 r_{ma x} - 1$ bias terms to learn, regardless of sequence length

Interpretability

bg right fit

The arXiv version of the paper has some fun visualizations
This is Figure 5, showing learned attention from two different heads of the encoder self-attention layer

A smattering of Large language models

GPT (2018): Train to predict the next word on a lot of text, then fine-tune
- Used only masked self-attention layers (the decoder part)
BERT (2018): Bidirectional Encoder Representations from Transformers
- Train to predict missing words in a sentence, then fine-tune
- Used unmasked self-attention layers only (the encoder part)
GPT-2 (2019): Bigger, better, and capable even without fine-tuning
Llama (2023): Accessible and open source language model
DeepSeek (2024): Significantly more efficient, but accused of being a distillation of OpenAI’s models

Hugging Face

Hugging Face provides a whole ecosystem for working with transformers

Easiest way is with the pipeline interface for inference:

from transformers import pipeline
nlp = pipeline("sentiment-analysis") # for example
nlp("Cats are fickle creatures")

Hugging Face models can also be fine-tuned on your own data

And now for something completely different: Deep reinforcement learning

Lecture 11: Reinforcement Learning

HTML Slides | PDF Slides

(Deep) Reinforcement Learning

COMP 4630 | Winter 2026 Charlotte Curtis

Overview

Terminology and fundamentals
Q-learning
Deep Q Networks
References and suggested reading:
- Scikit-learn book: Chapter 18
- d2l.ai: Chapter 17

Reinforcement Learning + LLMs

h:500 center

Terminology

Agent: the learner or decision maker
Environment: the world the agent interacts with
State: the current situation
Reward: feedback from the environment
Action: what the agent can do
Policy: the strategy the agent uses to make decisions

Classic example: Cartpole

The Credit Assignment Problem

Problem: If we’ve taken 100 actions and received a reward, which ones were “good” actions contributing to the reward?
Solution: Evaluate an action based on the sum of all future rewards
- Apply a discount factor $γ$ to future rewards, reducing their influence
- Common choice in the range of $γ = 0.9$ to $γ = 0.99$
- Example of actions/rewards:
  - Action: Right, Reward: 10
  - Action: Right, Reward: 0
  - Action: Right, Reward: -50

Policy Gradient Approach

If we can calculate the gradient of the expected reward with respect to the policy parameters, we can use gradient descent to find the best policy
Loss function example: REINFORCE (Williams, 1992) $L (θ) = - t \sum lo g π_{θ} (a_{t} ∣ s_{t}) \cdot r_{t}$ where:
- $π_{θ} (a_{t} ∣ s_{t})$ = output probability for action $a_{t}$ given state $s_{t}$
- $r_{t}$ = reward at timestep $t$
- $θ$ = model parameters

Value-based methods

Policy gradient approach is direct, but only really works for simple policies
Value-based methods instead explore the space and learn the value associated with a given action
Based on Markov Decision Processes (review from AI?)

Markov Chains

A Markov Chain is a model of random states where the future state depends only on the current state (a memoryless process)
Used to model real-world processes, e.g. Google’s PageRank algorithm
❓ Which of these is the terminal state?

Markov Decision Processes

center

Like a Markov Chain, but with actions and rewards
Bellman optimality equation: $V^{*} (s) = a max s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ V^{*} (s^{'})] for all s$

Iterative solution to Bellman’s equation

Value Iteration:

Initialize $V (s) = 0$ for all states
Update $V (s)$ using the Bellman equation
Repeat until convergence

$V_{k + 1} (s) \leftarrow a max s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ V_{k} (s^{'})] for all s$

Problem: we still don’t know the optimal policy

Q-Values

Bellman’s equation for Q-values (optimal state-action pairs):

$Q_{k + 1} (s, a) \leftarrow s^{'} \sum T (s, a, s^{'}) [R (s, a, s^{'}) + γ a^{'} max Q_{k} (s^{'}, a^{'})]$

Optimal policy $π^{*} (s)$ :

$π^{*} (s) = ar g a max Q^{*} (s, a)$

For small spaces, we can use dynamic programming to iteratively solve for $Q^{*}$

Q-Learning

Q-Learning is a variation on Q-value iteration that learns the transition probabilities and rewards from experience
An agent interacts with the environment and keeps track of the estimated Q-values for each state-action pair
It’s also a type of temporal difference learning (TD learning), which is kind of similar to stochastic gradient descent
Interestingly, Q-learning is “off-policy” because it learns the optimal policy while following a different one (in this case, totally random exploration)

Q-Learning Update rule

At each iteration, the Q estimate is updated according to: $Q (s, a) \leftarrow (1 - α) \cdot Q (s, a) + α \cdot [r + γ \cdot a^{'} max Q (s^{'}, a^{'})]$
Where:
- $Q (s, a)$ is the estimated value of taking action $a$ in state $s$
- $α$ is the learning rate (decreasing over time)
- $r$ is the immediate reward
- $γ$ is the discount factor
- $max_{a^{'}} Q (s^{'}, a^{'})$ is the maximum Q-value for the next state

Exploration policies

❓ How do you balance short-term rewards, long-term rewards, and exploration?
Our small example used a purely random policy
$ϵ$ -greedy chooses to explore randomly with probability $ϵ$ , and greedily with probability $1 - ϵ$
Common to start with high $e p s i l o n$ and gradually reduce (e.g. 1 down to 0.05)

Challenges with Q-Learning

❓ We just converged on a 3-state problem in 10k iterations. How many states are in something like an Atari game?
❓ How do we handle continuous state spaces?

One approach: Approximate Q-learning:

$Q_{θ} (s, a)$ approximates the Q-value for any state-action pair
The number of parameters $θ$ can be kept manageable
Turns out that neural networks are great for this!

Deep Q-Networks

We know states, actions, and observed rewards
We need to estimate the Q-values for each state-action pair
Target Q-values: $y (s, a) = r + γ \cdot max_{a^{'}} Q_{θ} (s^{'}, a^{'})$
- $r$ is the observed reward, $s^{'}$ is the next state
- $Q_{θ} (s^{'}, a^{'})$ is the network’s estimate of the future reward
Loss function: $L (θ) = ∣∣ y (s, a) - Q_{θ} (s, a) ∣ ∣^{2}$
Standard MSE, backpropagation, etc.

Challenges with DQNs

Catastrophic forgetting: just when it seems to converge, the network forgets what it learned about old states and comes crashing down
The learning environment keeps changing, which isn’t great for gradient descent
The loss value isn’t a good indicator of performance, particularly since we’re estimating both the target and the Q-values
Ultimately, reinforcement learning is inherently unstable!

The last topic: Geneterative AI and ethics

GenAI + Ethics Discussion

Generative images have gotten really good
What can we do? What should we do?

Lecture 12: Ethics Discussion

HTML Slides | PDF Slides

Generative AI, ethics, and policies

COMP 4630 | Winter 2026 Charlotte Curtis

Guiding questions for discussion

What are some ways that AI can be helpful? Harmful?
What are some ways that AI is used?
- Axes: intent (malicious <-> innocent), impact (harmful <-> helpful)
What concerns do you have about AI in general?
- What about in education, specifically at MRU?
What would you like to see in an AI policy at MRU?

Case study: vibe coding

bg right fit

Costs* of:

Voice to text
Code generation
Debugging
at ~3 Wh/request

bg fit

April fools?

Yesterday (March 31st, 2026), Claude Code’s source code was accidentally leaked
It’s an interesting code base to browse through, notably:

Impact

How do you, as CS majors, feel about these trends?
Have you tried using agentic coding tools?
What is the true cost of agentic coding?
- A recent paper attempts to quantify token usage by task
- Environmental impact, energy consumption, financial cost

Coming up next

Easter break, no class on Monday
Wednesday: One last journal club presentation (YOLO) and a smattering of additional topics (autoencoders, object detection, image generation?)
Monday: last day of class! Assignment 3 results, project checkpoint discussions. Maybe some kind of treats.

Tutorial 1: Data exploration and wrangling

Before building a machine learning model, it is important to understand and wrangle your data into an appropriate numeric format. In this tutorial, we’ll look at how I like to set up my projects and some tips for exploratory visualizations.

Part 1: Project configuration

Tutorials are not marked in this course, so it’s up to you to keep track of them separately. I recommend copying the this directory to a new location rather than forking the entire w26 repo and working out of that, otherwise you’ll have a bunch of merge conflicts and extra stuff if you want to submit a PR to the main repo. Alternatively, you can create a fork and work within a separate branch.

Tools:

Part 2: Exploratory visualizations

Visualizations for the purposes of exploring data (rather than communicating results) can be “quick and dirty”, but there are some guidelines to consider, as well as a few tricks that can help.

Follow along with the notebook and answer the various TODOs.

Part 3: Reverse engineer a cleaned dataset

Create a new .ipynb file to explore this new dataset
Read the raw data into a pandas DataFrame. You can either download the zip file, or install the ucimlrepo package and fetch the data directly.
Read the pre-processed version into a different pandas DataFrame.
Try to answer the following questions:
1. How were the categorical features handled?
2. Were any of the numerical categories manipulated?
3. What additional transformations might be useful for this dataset?

Tutorial 2: Linear algebra and NumPy exercises

Linear algebra is foundational to machine learning, and NumPy is a mature Python library that allows you to work efficiently with matrices and vectors.

For this tutorial, I’ll be providing a printed worksheet with some exercises to do by hand. You can then check your answers using NumPy. This serves to both refresh your memory on linear algebra as well as gain some familiarity working with NumPy.

NumPy Basics

By convention, NumPy is imported with the name np:

import numpy as np

You can then create an N-dimensional array by passing a standard Python list to np.array:

v = np.array([1, 2]) # a 1-D vector
print(v.shape) # prints (2, )
A = np.array([[1, 2], [3, 4]]) # a 2-D matrix
print(A.shape) # prints (2, 2)

The default multiplication operator is element-wise. If you want to use matrix multiplication, use the @ operator, or dot function for vectors:

print(A * v)
print(A @ v)
print(v.dot(v)) # or np.dot(v, v)

Output:

[[1 4]
 [3 8]]

[5, 11]

5

Transposing a matrix is quite simple, but the inverse needs the linalg submodule:

print(A.T) # Transpose
print(np.linalg.inv(A)) # Matrix inverse

Output:

[[1 3]
 [2 4]]

[[-2.   1. ]
 [ 1.5 -0.5]]

Similarly, linalg has useful functions like norm, det, solve… getting familiar with the docs can be handy!

More resources

Gradient Descent for Polynomial Regression

Note

Solution now available! You can view it rendered on GitHub here.

There’s some fake data in the file data.csv, with a single feature x and a true value y. Your task is to:

Load the data and look at it
Split it into training, validation, and test sets
Create your design matrix
Implement gradient descent to find the best fit polynomial
Evaluate your model’s performance and experiment with different hyperparameters

It’s up to you to decide what degree polynomial to fit the data, and you can also play around with stochastic gradient descent, mini-batch, hyperparameters, etc.

Important

Do this without the use of scikit learn or other libraries aside from numpy and matplotlib!

Step 0: Import libraries and seed your random number generator

It’s usually a good idea to start with a consistent random number seed to ensure reproducibility.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed="integer_of_your_choice")

Step 1: Load the data and look at it

x, y = np.loadtxt("data.csv", delimiter=",", skiprows=1, unpack=True)
#TODO: visualize

Step 2: Split the data

Weird numpy quirk: by default, a 1D array has a shape of (n,), but to behave as a proper vector, we need to convert it to be (n, 1). An easy way to do this is to pass np.newaxis as the second index when sampling your y data, e.g.:

n = len(y)
train_ids = rng.choice()
x_train, y_train = x[train_ids,], y[train_ids, np.newaxis]

Don’t worry about the x values for now, as we’ll be matrixifying them shortly anyway.

Step 3: Create your design matrix $X$ .

For the example given in class, the design matrix was simply a column of 1s concatenated with the feature vector, i.e.:

$X = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{m}$

For this exercise, you probably want to fit a higher degree polynomial, so the design matrix will be something like:

$X = 11 ⋮ 1 x_{1} x_{2} ⋮ x_{m} x_{1}^{2} x_{2}^{2} ⋮ x_{m}^{2} \dots \dots ⋮ \dots x_{1}^{d} x_{2}^{d} ⋮ x_{m}^{d}$

where $d$ is the degree of the polynomial you want to fit. Try multiple degrees and see what gives the best results.

A note on scaling: the range of x values in this example is fairly small, but if you choose a high degree polynomial you will still end up with fairly different scales for your “features”. Consider normalizing each column of the design matrix (other than the first column accounting for the bias term), remembering to calculate your scaling parameters on the training data and apply them to the validation/test data.

Since you’ll be doing this twice (train/test), you might want to define a function to create the design matrix given a vector x and a degree d.

Step 4: Implement gradient descent

This has a number of sub components. First you’ll need to define your gradient function. For mean squared error, the gradient can be calculated as:

$\nabla_{θ} MSE = \frac{2}{m} X^{T} (Xθ - y)$

where $X$ is your design matrix, $θ$ is the current parameter vector, and $y$ is the true target value.

It’ll also be useful to define the actual mean squared error to evaluate your model:

$MSE = \frac{1}{m} (X θ - y)^{T} (X θ - y)$

Now you can define your hyperparameters and run your gradient descent. For batch gradient descent, you’ll need to define:

learning rate $η$ (usually in the range of $1 0^{- 5}$ to $1 0^{- 2}$ )
stopping criterion (can just be a fixed number of iterations)

The general algorithm for gradient descent is:

Start with a random $θ$
Calculate the gradient $\nabla_{θ}$ for the current $θ$
Update $θ$ as $θ = θ - η \nabla_{θ}$
Repeat 2-4 until some stopping criterion is met

You could also try mini-batch or stochastic gradient descent by adding an outer epoch loop if you want to get fancy.

Step 5: Evaluate your model’s performance and experiment

Now that you’ve computed a final estimate of $θ$ , apply it to your test set to see how well your model performs, perhaps by plotting the data as well as the best fit curve. If it doesn’t look good, try changing various hyperparameters, like $η$ , number of iterations, and degree of polynomial. If you didn’t rescale your design matrix earlier, try it now!

Technically we should have done a 3-way train/validate/test split, but I kept it as just train/test to keep things manageable.

Backpropagation with a toy MLP

Before we move on to a full-featured toolbox, I wanted to provide you with something a bit simpler. I’ve written some (questionable) code in mlp_regressor.py to try to implement a multi-layer perceptron. I’ve also provided a starter notebook after throwing you to the wolves last week - you can download your copy here, or from GitHub.

Step 1: Load and preprocess data

We’ll use a well-known and fairly clean dataset to try to predict wine quality. You’ll still need to encode (or ignore) the one categorical feature color, then split and normalize the inputs. You’ll also need to pip install ucimlrepo to get the data-fetching module.

Step 2: Build and train an MLP

The MLPRegressor class should be able to train a small multi-layer perceptron. You can use it like this:

from mlp_regressor import MLPRegressor
mlp = MLPRegressor(X_train.shape[1])
mlp.add_layer(<number of neurons>, "activation function")
... repeat
print(mlp) # to see a summary of layers

loss = mlp.train(X_train, y_train, step_size, epochs)
plt.plot(loss)

It’s very inefficient, so don’t go too crazy with number of neurons. After training, you can predict by just running the forward pass:

y_pred = mlp.forward(X_train)

There’s also an example in the main block of mlp_regressor.py

Step 3: Modify the MLP

Try to read through the forward and backward passes to understand how it works. It’s entirely possible I’ve made a mistake somewhere, so don’t hesitate to ask if something doesn’t make sense.

To understand things in more detail, it can be helpful to try to modify it. Right now, the MLP only does whole-batch gradient descent. Can you modify it so that it does mini-batch or stochastic gradient descent?

Modern neural networks with PyTorch

In this lab, I’m going to be walking through the PyTorch intro code that we started in lecture before the midterm. This is a pretty small model so it should be feasible on lab computers or laptops, but you might want to get used to using Colab as well.

I’ll also introduce you to Assignment 2!

RNN Activity: Predicting the stock market

[!NOTE] I do not recommend making any kind of financial decisions based on RNNs.

Setup

To fetch the data, we’ll need the yfinance package. Activate your virtual environment, then run pip install yfinance. The starter notebook shows how to download some data.

Steps

As usual, we need to split the data. For time series data, it is very important that you split chronologically, not randomly. The yfinance data has a DateTimeIndex, which means that we can index it with dates directly, e.g:
```
 train_end = '2023-12-31'
 train = data['Close'][:train_end].values
```
This also assumes that we’re interested in just the daily closing price. I’d suggest at least a year’s worth of data for both test and validation, with the rest for training.
Decide on a scaling factor and rescale your data to the 1-ish range. This should not be informed by the test or validation data!
Follow the various TODO items to get the single-prediction RNN working.
Modify both the TimeSeriesDataset and the SimpleRnnModel to predict multiple days in the future instead of just one.

[!TIP] You may need to squeeze and/or expand your data at some point in the process to ensure the data dimensions are correct

Further Exercises and Questions

What impact does the scaling have on the model? Can you trigger exploding/vanishing gradients?
Can you improve the model? Try playing around with LSTM/GRU instead of RNN, number of hidden units, activation functions, etc
What happens if you pass a longer time window to the model? You will need to create a new torch.FloatTensor from the numpy data to do this instead of using the TimeSeriesDataset class.

Tutorial 10: Applied Transformers

We don’t really have the computational resources to train our own transformer models, but we can play around with pre-trained models.

The HuggingFace Transformers API is an easy way to download a pre-trained model and either use as-is, or fine-tune for your application. I’ll provide a few suggestions, but the official tutorial has a lot more info.

Using pretrained transformer models

If you’re working on Kaggle or Colab, the transformers library may already be installed; otherwise, you’ll need to pip install transformers.

Pick a task and a model from the model list - I recommend sticking to something on the small side for your own sanity. Text generation is fun to play with so I’m going to start there, e.g.:

from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Tip

If you’re on a GPU-enabled platform like Kaggle, you can move the model to GPU as follows:
from accelerate import Accelerator

device = Accelerator().device

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device=device)

Test it out! The simplest way to interact with a text generation model is to pass it a string:
```
pipe("What is the airspeed velocity of an unladen swallow?")
```
Add context. Some of the models (like the Deepseek example) allow additional chat context, e.g.:
```
messages = [
    {"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"},
]
pipe(messages)
```
How does this affect the value that is returned?
You can inspect and modify the configuration of the pipeline with the pipe.generation_config object. Try changing the temperature parameter and see how it impacts the results, e.g.:
```
pipe.generation_config.temperature = 1.5
```
I’d also recommend reducing the max_new_tokens so that it runs faster and doesn’t ramble quite so much!
To make a real interactive chatbot, create a sentinel loop that:
1. prompts for input
2. passes the input to the pipeline
3. prints out the response
4. concatenates the response with the previous messages
5. prompts for input again

There’s lots more to play around with in this library, but that’s probably plenty for the tutorial time period.

Assignment 1: Data discovery and visualization

Due January 30, 2025 at 5 pm. Reasonable requests for extensions will be granted when requested at least 48 hours before the due date.

You may work in groups of 2 or 3. Click here to create your group on GitHub Classroom and clone the starting code.

Overview

Real world data is messy and incomplete in unexpected ways. Often, the information you need is in some kind of text field, or in a totally separate database that needs to be merged in. While I have done some basic filtering of the dataset, the focus of this assignment is on exploring and preparing the data for use with a machine learning model.

After exploring and cleaning your data, you will be training a regressor to predict a numeric value. You may choose any of the regression models from scikit-learn’s supervised learning modules, provided inference (prediction) is fairly fast (i.e. don’t use nearest neighbours). If you want to use something other than scikit-learn, just let me know - it’s probably fine! Again, the main focus of this assignment is the dataset exploration and processing part.

Choosing a model is a whole thing in itself, but don’t stress about it too much. Feel free to compare a couple, or consult this chart for guidance, but it doesn’t cover everything. Notably, decisions trees and random forests are missing, despite being pretty common, easily understood, and high-performing models.

I have set aside a subset of the data to test your model. Model performance will be evaluated and compared using the Mean Absolute Error. Note that since these are not practice or training datasets, the results may not be very good! As long as you’re in a reasonable range (less than about 100% relative error), your grade will not be affected by the prediction performance.

YYC Housing Data

Download the csv from Google Drive (it’s too big for GitHub!)

Original Source

I have removed some redundant columns and excluded non-residential properties such as parking spots. I’ve also combined the 2023 and 2024 assessment values and excluded properties where the roll_number disappeared from one year to the next (e.g. in the case of a subdivision).

Unlike the California housing dataset example, this contains property details and assessed values for each individual house in Calgary. Your goal will be to try to predict the change in assessment value from 2023 to 2024 based on the other columns of the dataset. Be careful though - there are some weird city-specific codes in there. For example, the sub_property_use column contains the following codes:

Code	Description
RE0100	Residential Acreage
RE0110	Detached
RE0111	Detached with Backyard Suite
RE0120	Duplex
RE0121	Duplex Building
RE0201	Low Rise Apartment Condo
RE0210	Low Rise Rental Condo
RE0301	High Rise Apartment Condo
RE0310	High Rise Rental Condo
RE0401	Townhouse
RE0410	Townhouse Complex
RE0601	Collective Residence
RE0800	Manufactured Home

Similarly, the land_use_designation refers to the city’s land use zones, which restrict the type of building that can be constructed on a given property. These zones are about to become much simpler, but for now the column exists in the dataset. It’s up to you to decide how (or if) to use it.

Feel free to get creative! Did odd-numbered houses increase more than even? Does distance from city centre make a difference? Apply your domain knowledge to select and transform your features.

Deliverables

Your assignment should consist of both a .ipynb notebook (committed with cells rendered) for your exploratory analysis and model training, as well as your “production” code providing a predict function.

Exploration and model training notebook

This notebook should follow the general process outlined in class to do the first four steps of the ML Project Checklist, plus training of a simple model. The emphasis is on the data exploration and preparation rather than the model itself. Model shortlisting and fine-tuning is not required (though it is allowed if you’d like).

Specifically, this notebook should include:

loading the data
setting aside a test set (as appropriate for the problem)
exploratory visualizations, with comments about your observations
your preprocessing pipeline
your model training, either with cross-validation or a set-aside validation dataset
saving your pipeline + model for production

Some guidelines are provided in the template notebook - feel free to modify as desired.

If you try something and then ultimately don’t use it, it’s fine to leave it in the notebook. I’d like to see things you thought of and then discarded.

Note

While most of your preprocessing decisions are up to you, please do not drop any samples. If you encounter missing values, you can drop the column or impute values, but the number of predicted values must match the number of input samples.

“Production” code

After deciding on a preprocessing pipeline, training a model, and saving it all to disk, implement the function predict in prod.py. This function should:

load the required libraries
load your model from disk
apply your preprocessing
return the predicted values

If you are using Scikit-learn’s Pipeline class to combine preprocessing with your regression model, then this function could be as simple as loading your pipeline and returning pipeline.predict(data).

predict should take as input a Panda dataframe of the data with the re_assessed_value_2024 column removed, and return a Numpy array of predicted property assessment changes from 2023 to 2024. Make sure not to drop any samples! The length of the output must match the length of the input.

Important

Make sure that I can run your prod.py code! I will be running your code from a master script in a Python 3.12 environment with the packages defined in requirements.txt installed. I will also cd to your repo and import prod. Please ensure:

You are loading your model using a relative path

If you have additional Python dependencies, add them to requirements.txt

If you want to use a different language altogether, that is okay - just make sure that your dependencies are clearly documented and easy to install on Windows 11.

Written response

Answer the questions in README.md. Point form and short responses are fine! If you really hate Markdown, you can add a PDF instead.

Tips

I recommend creating a virtual environment and installing the packages in requirements.txt. This will ensure that your code runs on my system:
```
python -m venv venv
pip install -r requirements.txt
```
on a Mac, use python3 and pip3 instead.

If you use any other packages and want to add them to the requirements list, you can update it with:
```
pip freeze > requirements.txt
```
(again with pip3 if you are a Mac user).
The end-to-end ML project from the textbook (presented in a condensed form in class) provides examples of some data transformation and visualization techniques, but these do not cover all scenarios. You may need to do some additional research to find the right technique for your dataset - in this case, make sure to cite your sources with a comment in your code.
I have reserved some data for a friendly competition between groups. You might want to test your predict function with your own subset of data to make sure the loading and processing behaves in an isolated environment.
Make sure to remove the target column from the dataframe before processing! I will be calling predict with a dataframe that has re_assessed_value_2024 removed.

Marking Scheme

Each of the following components will be marked on a 4-point scale and weighted.

Component	Weight
Data exploration (visualizations, observations)	30%
Preprocessing decisions	30%
Model inference works and training approach is good	20%
Written responses	15%

Score	Description
4	Excellent - thoughtful and creative without any errors or omissions
3	Pretty good, but with minor errors or omissions
2	Mostly complete, but with major errors or omissions, lacking in detail
1	A minimal effort was made, incomplete or incorrect
0	No effort was made, or the submission is plagiarized

Assignment 2: Playing card classification

Due March 6, 2026 at 5 pm, with the usual weekend flexibility.

You may work in teams of 2 or 3. Click here to create your team on GitHub Classroom and clone the starter code. This can be the same team or different from assignment 1.

Overview

The purpose of this assignment is to apply your theoretical knowledge of neural networks (particularly convolutional neural networks) to a real application. You will build and train a model from “scratch” (using PyTorch or some other framework, not really from scratch), and then see how much you can reduce its size while minimizing performance degradation. In addition to building and tweaking a neural network, this assignment serves as an introduction to:

Working with a modern NN framework
Preprocessing for image data
Evaluating classification models

Dataset

This time, I’m choosing the dataset: Playing Cards. This is quite a clean dataset with reasonable class balance. There are lots of implementations of classifiers using this dataset, and if you do look at someone else’ work for ideas, make sure to cite your sources and understand what you are implementing.

Caution

This dataset is very consistent compared to playing cards found in the wild. I will be evaluating on my own hand-curated dataset - you may want to augment yours with some extra card images as well.

Deliverables

Your assignment should consist of the following:

Your notebook(s) and/or Python scripts where you did your experiments, with the final training run and evaluation rendered
A report describing your experiments and your final model decisions
Your final model classes in a Python module named your_team_name.py, alongside saved weights. Please make sure that your models load and run properly in Colab. If additional packages are required, list them in your report document.

I would recommend working in parallel with your teammate(s) and commit your changes after each experiment. It’s fine if you have multiple working notebooks, just indicate to me which is the final version.

Your training code

Your training code (notebook or Python script) should:

Load the training data and do some basic data exploration, like looking at samples, number of classes, class distribution, etc. The code in starter.ipynb provides some ideas for connecting Colab to Google Drive, defining the training dataset, and inspecting a few samples.
Do any preprocessing you might want to do (at the very least, you’ll probably want to rescale the images from unsigned ints in the range 0-255 to floats in the range 0 to 1)

Hint: Pytorch has some preprocessing layers that you can stick on the start of your model much like Scikit-learn’s pipelines. Check out Torchvision Transforms for more ideas.
Define and train a model, keeping in mind the following:
- The input layer must match the number of channels of your input. You do not need to define the batch size.
- The output layer must have as many neurons as classes you are trying to predict.
- Everything in between is a design choice that you can tweak!
Iterate! I would suggest starting with a simple CNN feeding in to a fully connected output layer. I included my fairly random and not at all optimized model in the starter code.
Train two models: one “best performance” version, where you try to get the highest accuracy, and one “size optimized” where you try to maintain reasonably good accuracy with the fewest parameters.
Once you’ve trained your models, save the weights using torch.save(model.state_dict()) as described here. You may need to share these with me via Google Drive if they end up too big, but if not, you can just commit the binary weights file to your repo.

Your production code

To load the weights for your model and run inference, PyTorch needs to know the class definition. There is a way of saving the whole thing at once, but it’s fragile (much like pickling a Scikit-learn pipeline with custom functions).

After you’re happy with both small and large models, copy the class definitions in a Python module named for your team name (e.g. super_awesome_team.py). Also copy over your transformation pipeline - I will be passing this to ImageFolder when I load the top secret evaluation set. You can either use the same one for small and large models, or two separate ones, or parameterize with image size, etc.

from super_awesome_team import SmallModel, BigModel, img_transform
model = SmallModel()
model.load_state_dict(torch.load(path_to_small_weights, weights_only=True))
model.eval() # disables dropout and batch norm for inference

Tip

Make sure to include any necessary preprocessing like rescaling in your transformation pipeline, but do not include image augmentation steps like RandomHorizontalFlip.

Your report

In a separate document, summarize your experiments, models, observations, reflections, etc. I’ve provided a template with more details in the starter code (report.md), though you aren’t limited to the markdown format.

Marking Scheme

Each of the following components will be marked on a 4-point scale and weighted.

Component	Weight
Report: model development and experimentation	20%
Report: reflections	20%
Report: abstract and appendices	20%
Model: load and run on Colab	20%
Model: performance (highest accuracy model)	10%
Model: performance / parameters ratio (size optimized)	10%

Score	Description
4	Excellent - thoughtful and creative without any errors or omissions
3	Pretty good, but with minor errors or omissions
2	Mostly complete, but with major errors or omissions, lacking in detail
1	A minimal effort was made, incomplete or incorrect
0	No effort was made, or the submission is plagiarized

Assignment 3: Classification of Text Data

Due March 20 27, 2026

You may work in teams up to 3. Click here to create your team on GitHub Classroom.

Overview

The purpose of this assignment is to further your hands-on experience in neural networks with a new type of data: text!

Dataset and task

Download the pre-split “Reddit jokes” csv here (200 MB). You will still need to do a validation split, but I’ve reserved a final test set as with the previous assignments.

The original dataset is 1 million reddit jokes, but I’ve modified it a bit. I added a threshold to the score value to classify each joke as is_funny = True or is_funny=False, and I’ve also removed some redundant columns. I left the score value in there so you can see just how “funny” people thought a given joke was, but the score cannot be used as a predictor.

Your task is to predict whether or not a joke is funny. You may use any of the information in the dataset except for the score column to predict the is_funny value.

[!WARNING] I have not vetted this dataset for appropriateness. While the “not safe for work” label is False for all features, there is still likely to be the usual degeneration one encounters on Reddit.

Many of the posts have [removed], [deleted], and NaN for the body text field. This is still information! Aside from replacing NaN with an empty string, you probably want to keep these features.

Deliverables

Your repo should have the following:

Your notebook(s) and/or Python scripts where you did your experiments, with the final training run and evaluation rendered
A report describing your experiments and your final model decisions
An inference script with a classify function that loads your model, performs any preprocessing necessary on the list of titles given as inputs, then returns a list of booleans containing the predicted funniness. If your model file(s) is too big for GitHub, you may share a link to it on Google Drive.

Resources

GPU resources are a challenge, and potentially more important with text data. Here are a few options to consider:

Kaggle provides 30 hrs/week of free GPU usage
Colab provides a “pay as you go” tier, which is $14 f or 100 co m p u t e u ni t s w i t h 90 d a ye x p i r a t i o n . C o l ab P ro i s a l so an o pt i o na t$ 14/month. I hate asking students to pay for things, but think of it like the cost of a textbook. I’m testing out the pay as you go option, and it seems to be using up credits at roughly 1.5/hr on a T4 TPU.

Your training notebook

I’ve included some sample code in the starter.ipynb to get you started. If you’re not using Colab and/or PyTorch, you might need to massage things a bit more (yes, this ends up being an inordinate amount of time in any data project).

Your inference script

This time, I’m asking that you implement an inference script similar to assignment 1 in prod.py. This should handle your data processing, model class definitions, loading the model weights, etc. You can assume that I will be loading the model weights from the working directory; please use relative paths to your weights file.

Your report

Marking Scheme

I’ll be marking each component on the usual 4-point scale and weighting as follows:

Component	Weight
Model development and experimentation process	30%
Model performance and compatibility	20%
Report: reflections	30%
Report: other stuff	20%

As usual, the raw performance doesn’t matter too much provided it behaves “okay”

Score	Description
4	Excellent - thoughtful and creative without any errors or omissions
3	Pretty good, but with minor errors or omissions
2	Mostly complete, but with major errors or omissions, lacking in detail
1	A minimal effort was made, incomplete or incorrect
0	No effort was made, or the submission is plagiarized

Journal Club

Your 10% journal club mark consists of two components:

Reading each paper and participating in weekly discussions (5%)
Reading a single paper more thoroughly and presenting that paper (5%)

The signup sheet is on D2L. You will need to be logged in to your MRU Google account to access it.

Presentation

The week you have signed up to present, read the paper and prepare a (maximum) 10 minute presentation. You may use whatever file format you wish, and feel free to copy and paste relevant diagrams or quotes from the paper (a single citation at the start suffices for this purpose). If you include material from other resources, these must be cited as usual.

When creating your presentation, consider the following guiding questions. You don’t need to use these headings specifically, and not every question applies to every paper, but it’s a place to start.

Paper Summary

What are the main contributions of this work?
What is the context (timeframe, prior knowledge, etc) of this paper?
How did the authors justify their conclusions?

Paper’s influence

With the power of hindsight, we can look back and see what impact each paper had on the field of machine learning. Most of them are widely cited. Consider the following:

If applicable, try to find implementations of the algorithms described by the paper. These may be shared by the authors (for more recent papers), or implemented by others (for pre-2000s papers)
Are the methods described in the paper still relevant today?
What is your impression of the impact that this paper had on the field of machine learning?

Your opinions

What did you like about the paper?
What was the most confusing or challenging to understand?
Any additional thoughts?

Rubric

Each of the three sections will be assessed on a 4-point scale, with an additional score for sharing your thoughts and responding to classmates’ questions during the discussion period.

Component	Score out of 4
Paper summary
Paper’s influence
Your opinions
Discussion
Total out of 16

Score	Description
4	Excellent - thoughtful and creative without any errors or omissions
3	Pretty good, but with minor errors or omissions
2	Mostly complete, but with major errors or omissions, lacking in detail
1	A minimal effort was made, incomplete or incorrect
0	No effort was made, or the submission is plagiarized

Final Project: Predict something with some data

	Deliverable	Due date	Weight (of project)
1	Proposal	Feb 27 Pitch Presentation in lab on March 2	15%
2	Checkpoint 1	March 27	20%
2	Checkpoint 2	April 13	20%
4	Final Report	April 24 Presentations on April 22*	45%

*I asked the registrar to schedule a block of time during final exams for you to share the outcomes of your projects, so this shows up on MyMRU as a final exam. You can put finishing touches on your report in the days afterwards.

You may work in teams of 2 or 3.

Overview

The goal of the final project is to showcase the skills you’ve learned in this course by applying them to a new dataset and task. You are free to choose anything you like, but do not pick a “hello-world” style dataset. I’d rather you take a risk and have it not be successful than recreate yet another Titanic survival prediction model.

Overall, throughout this project you will:

Choose a task, such as classification, regression, time-series forecasting, NLP, etc.
Find an appropriate dataset for that task. This is harder than it sounds, and may require some iterating when you start working with the data.
Explore the data and preprocess as necessary to handle missing values, feature scaling, etc.
Implement a model to accomplish your task.
Write a formal report describing your project.

Tip

Often when you are writing a report and describing your methodology, you realize there is an error in that methodology and end up having to go back and fix something. Don’t leave the writeup to the very end - it’s a good idea to write as you go along.

Finding a Dataset

First, think of your interests and a possible prediction task, then look to see if there’s a relevant publicly accessible dataset. As you poke around you might do it the other way and stumble across an interesting dataset that inspires a task - that’s okay too!

Some good places to look for datasets:

Hugging Face - A popular data- and model-sharing site, and the successor to the now-defunct “Papers with Code” (RIP)
Google Dataset Search - of course Google does datasets as well.
Wikipedia - a giant list of datasets for machine learning research.
Google BigQuery Datasets - we’ll be using this in assignment 3, so I’ll provide guidance on accessing it
Kaggle - a website for machine learning competitions with a huge number of datasets. You can filter by file size, topic and more

Submissions

This time, there is no starter code to share, as each of your projects will be different. Please submit your proposal, checkpoint and report as PDFs on D2L, and include a link to your code (e.g. on GitHub or Google Drive) in your final report abstract. This code may be public so you can use it as a portfolio piece, or if you prefer to keep it private, you can invite me to view it on GitHub.

4-Point Scale

Each of the components will be marked on various criteria using the usual 4 point scale. More details provided in each component description.

Score	Description
4	Excellent - thoughtful and creative without any errors or omissions
3	Pretty good, but with minor errors or omissions
2	Mostly complete, but with major errors or omissions, lacking in detail
1	A minimal effort was made, incomplete or incorrect
0	No effort was made, or the submission is plagiarized

Project Proposal

Your project proposal should be fairly concise, 1-2 pages consisting of the following sections.

Abstract

1-paragraph summary what you intend to do and with what data.

Project Plan

Using words, tables, images, etc, describe:

The goal of your project
The dataset you intend to use
The approach you plan to take, referring to at least one paper

Risk Assessment and Backup Plan

Scoping a project is very challenging, and generally we want to do more than can be accomplished in the provided time. In this section, identify:

Challenges you anticipate, and potential solutions/mitigations (this may be best presented as a table)
A reduced scope goal that will still constitute project success if your primary objective can’t be met

References

Include a link to your proposed dataset and cite at least one technical paper describing the approach you plan to use.

Pitch Presentation

Prepare a very short presentation (1-2 slides, 3-4 minutes) to “pitch” your project during the lab on Monday, March 2. You can highlight whatever part is the most exciting to you, the main goal is to share with the class and learn who is working on similar projects.

Marking Scheme

Using usual 4 point scale, I will be assessing the following equally-weighted categories:

Description of task and dataset
Thoughtful risk assessment and contingency plan
Appropriate references and tentative approach
Pitch presentation

For a total of 16 points.

2. Checkpoint

At this point, you should have begun some preliminary data exploration and preprocessing. You may have also begun building your model pipeline. Your checkpoint report should again be 1-2 pages with the following sections:

Reasonable amount of progress made on project
Description of challenges and assessment of progress
Remaining work and revision (if needed)
Overall coherence and clarity

For a total of 16 points.

Keyboard shortcuts

COMP 4630 | Winter 2026