Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome!

This site contains lecture notes, tutorials, and assignment instructions for COMP 4630: Machine Learning. Use the sidebar on the left to view the material, and if you notice a problem, please report an issue at the GitHub repo.

Sometimes I’ll write stuff on the board. You can find a version of my scribbles here, which I’ll try to keep more or less up to date.

Lecture 1: Data Exploration

HTML Slides | PDF Slides

Welcome to Machine Learning!

COMP 4630 | Winter 2026 Charlotte Curtis


What is this course about?

  • Continuing the supervised/unsupervised learning algorithms from COMP 3652, with a focus on Neural Networks
  • First half: the history, theory, and math behind neural networks
  • Second half: applications of NNs in computer vision, natural language processing, and more

This is not (just) a course on building models using libraries like TensorFlow or PyTorch, it is a course on understanding the theory


How did I get involved with ML?

center w:900px


What do you want to learn about ML?


Grade Assessment

ComponentWeight
Assignments3 x 10%
Midterm (theory) exam20%
Journal club10%
Final project40%

Bonus marks may be awarded for substantial corrections to materials, submitted as pull requests

Course materials repo: https://github.com/mru-comp4630/w26 Rendered at: https://mru-comp4630.github.io/w26/


Textbooks and other readings

Primary Textbook:

More mathy details:

Journal club list: on D2L under “Course Info” (requires MRU library login)


Generative AI policy

  • Yes, AI can do a lot of what I’m asking for in this course
  • No, I do not want to read about what AI “thinks”
  • ❓ What do you think is an appropriate use?

Machine Learning Project Checklist

Appendix A of the hands-on textbook

  1. Frame the problem and look at the big picture.
  2. Get the data.
  3. Explore the data to gain insights.
  4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
  5. Explore many different models and short-list the best ones.
  6. Fine-tune your models and combine them into a great solution.
  7. Present your solution.
  8. Launch, monitor, and maintain your system.
Bold stuff is what we'll discuss in this topic

1. Look at the big picture

Example Dataset: California housing prices (1990)

❓ Discussion questions:

  • How does the company expect to use and benefit from this model?
  • What is the current solution?
  • What kind of ML task is this?
  • What kind of performance measure should we use?

Where we left off on Wednesday, January 7

First, some stuff about assessments


2. Get the data

For this class, we’ll use readily available datasets. Some sources are:

After fetching the data, set aside a test set and don’t look at it.

“Get the data” can often be a huge task in itself!


2a. Set aside a test set

❓ Discussion questions:

  • Why do we need an independent test set?
  • Why would we use a random seed?
  • What is naive about simply selecting a random sample?
  • What else could we do?
  • What is stratified sampling?

Side tangent: Sampling bias

  • Simple example: assume 80% of population likes cilantro
  • Goal: ensure our sample is representative of the population,

The binomial distribution can be used to model the probability of choosing people who like cilantro from total participants:


Side tangent: Sampling bias continued

is the probability mass function, and the corresponding cumulative distribution function is just the sum up to :

Suppose we randomly sample 100 people. What is the probability of fewer than 75 or more than 85 cilantro lovers?

This is also my excuse to review some probability theory and notation


3. Explore the data

❓ Discussion questions:

  • What do you notice about the data?
  • Do the values make sense for the labels?
  • Is the scale of the features comparable? Does this matter?
  • What possible biases might be present in the data?

3a. Look for correlations

The Pearson correlation coefficient is a measure of the linear correlation between two variables and (commonly denoted as ):

where and are the sample means of and , respectively.

  • What do correlations of 0, 1, and -1 mean?
  • What are some limitations of Pearson correlation?

Where we left off on Monday, January 12


4. Prepare the data

General goals:

  • Handle missing data, and maybe outliers
  • Drop irrelevant features
  • Combine features using domain knowledge
  • Apply various transformations (e.g. scaling, encoding)
  • Apply scaling when necessary

4a. Handling missing data

In the book 3 options are listed to handle the NaN values:

housing.dropna(subset=["total_bedrooms"], inplace=True) ## option 1
housing.drop("total_bedrooms", axis=1)                  ## option 2
median = housing["total_bedrooms"].median()             ## option 3
housing["total_bedrooms"].fillna(median, inplace=True)

❓ Discussion questions:

  • What is each option doing?
  • What are the pros and cons of each option?
  • Which one should we choose?

4b. Handling non-numeric data

Most of the math in ML algorithms is based on numbers, so we need to convert text and categorical attributes to numbers. This is called encoding.

❓ Discussion questions:

  • Which columns of our data are categorical?
  • What methods could we use to convert them to numbers?
  • What are the assumptions about the various encoding methods?

4c. Scaling the data

Many ML algorithms don’t like features with vastly different scales. Common scaling methods are min-max scaling and standardization.

Important: scaling is computed on the training set and applied to the validation and test sets - they are not scaled independently!

❓ Discussions questions:

  • What are the bounds of each method?
  • Which method is more affected by outliers?
  • How would you decide which method to use?

4e. Standardization details

A general Gaussian distribution is given by:

where is the mean and is the standard deviation. The standard normal distribution is a special case where and , reducing the equation to:


4f. Other transformations

  • Log transformation: useful for data that is heavily skewed
  • Also square root, squaring, etc.: try to remove heavy tails
  • Feature engineering: combining features to create new ones
  • Binning: turning continuous data into discrete categories
    • Possibly using K-means clustering
    • Relies on domain knowledge
  • Best to create a transformation pipeline and apply it to the data rather than saving the transformed data

Coming up next

  • Math review:
    • Linear algebra
    • Differential calculus
    • Statistics
  • A brief introduction to vector calculus

Lecture 2: Math Review

HTML Slides | PDF Slides

Math review

COMP 4630 | Winter 2026 Charlotte Curtis


Math review

  • MATH 1203: Linear algebra
  • MATH 1200: Differential calculus
  • MATH 2234: Statistics

Further reading:


Linear algebra

Vectors are multidimensional quantities (unlike scalars):

A common vector space is , or the 2D Euclidean plane. Example:


Vector operations

  • Addition:
  • Scalar multiplication:
  • Dot product: (yields a scalar)
    • Can be thought of as the projection of one vector onto another, or how much two vectors are aligned in the same direction

Vector norms

  • The norm of a vector is a measure of its length
  • Most common is the Euclidean norm (or norm):
  • You might also see the norm, particularly as a regularization term:

Useful vectors

  • Unit vector: A vector with a norm of 1, e.g. ,
  • Normalized vector: A vector divided by its norm, e.g.
  • Dot product can also be written as

Yes, a normalized vector is also a unit vector, main difference is in context and notation


Matrices

A matrix is a 2D array of numbers:

Notation: Element is in row , column , also written as .

Rows then columns! matrix has rows and columns


Matrix operations

  • Addition: element-wise if dimensions match.
  • Scalar multiplication: just like vectors
  • Matrix multiplication: where the elements of are:
    • Multiply and sum rows of with columns of
    • Usually,

Matrix multiplication examples

Matrix times a matrix:

Matrix times a vector:


Where we left off on January 14


Matrix transpose

  • Transpose: swaps rows and columns

  • Inverse: just as , , where is the identity matrix

Not every matrix is invertible!


Calculus: Notation

The derivative of a function is represented as:

The second derivative is denoted:

and so on.


Differentiability

bg left fit

For a function to be differentiable at a point , it must be:

  • Defined at
  • Continuous at
  • Smooth at
  • Non-vertical at

Select rules of differentiation

Function LagrangeLeibniz
Constant
Power with
Sum
Exponential
Chain Rule
This is the kind of thing I would not expect you to memorize on an exam

Chain rule example

  1. Find for

  2. Now, let, , where . What is ?


Partial derivatives

For a scalar valued function , there are two partial derivatives:

These are computed by holding the “other” variable(s) constant. For example, if , then:


A brief introduction to vector calculus

Putting together partial derivatives with vectors and matrices we get:

Scalar-valued :

Vector-valued :

Most of the time we’ll just be working with the gradient


Statistics: Notation

  • A random variable is a variable that can take on random variables according to some probability distribution
  • may take on discrete (e.g. dice rolls) or continuous (e.g. age) values
  • or for the random variable and or for a specific value
  • for a a discrete distribution and for continuous
  • and

Some textbooks/papers/websites use different notation!


Discrete random variables

  • A discrete probability mass function describes the probability of taking on a specific value
  • Example: for a balanced 6-sided die,
  • You can add together probabilities, e.g.
  • and for any valid distribution

Continuous random variables

  • A continuous probability density function gives the probability of being in some tiny interval given by
  • Example: the uniform distribution, for
  • for any specific value
  • Need to integrate to get a concrete value, e.g.
  • and for any valid distribution

Expectation and variance

  • The expectation or expected value is its average value
  • and
  • More generally, for any function :
  • The variance describes how much the values vary from their mean:

Multiple random variables

  • Joint probability is the probability of and occurring together
  • Conditional probability is the probability that takes on value given that has already happened
  • In general,
  • For independent variables,
Note: I'm using uppercase P here, but it all applies to continuous distributions as well

Covariance

  • The covariance between and gives a sense of how linearly related they are and how much they vary together:
  • Related to correlation as
  • The covariance matrix of a random vector is a square matrix where the element is the covariance between and
  • The diagonal of the covariance matrix gives

The Normal distribution

Good “default choice” for two reasons:

  • The central limit theorem shows that the sum of many ( ish) independent random variables is normally distributed
  • Has the most uncertainty of any distribution with the same variance

We can’t easily integrate , so numerical approximations are used


bg fit


Coming up next

  • Training (regression) models
    • Linear regression
    • Gradient descent
  • References and suggested reading:

Lecture 3: Training models

HTML Slides | PDF Slides

Training Models with Regression and Gradient Descent

COMP 4630 | Winter 2026 Charlotte Curtis


Overview

  • Linear Regression and the Normal Equation
  • Gradient Descent and its various flavours
  • References and suggested reading:

Linear Regression

Unlike most models, linear regression has a closed-form solution called the Normal Equation:

where

  • are the weights of the model minimizing the cost function
  • is the vector of target values
  • is the design matrix of feature values

As usual, different sources use different notation, e.g. or instead of .


Consider the 1-d case:

bg right:40% 120%

we want the values of and that minimize the Mean Square Error between the actual and predicted values:


Solving for and

Math time!


Solving for and

bg right:40% 120%

After some algebraic gymnastics, we get:

where and are the means of the and values, respectively.


Expanding to matrix form

Instead of the scalar or even vector , it’s common to use a design matrix to represent the feature values:

where each row is an instance (sample) and each column is a feature.

The first column is all ones, representing the bias term


Back to the linear regression problem…

  • We can rewrite the estimate in matrix notation:

  • The MSE can be written as:

    where we’ve used the trick of substituting

:abacus: Find the gradient of the MSE w.r.t , set it to zero, and solve for


Properties of matrices and their transpose

The following properties are useful for solving linear algebra problems:

Additionally, any matrix or vector multiplied by is unchanged.


The Normal Equation

We made it! The Normal Equation is again:

  • No optimization is required to find the optimal
  • Limitations:
    • must be invertible and small enough to fit in memory
    • The computational complexity is (at least)
  • Even in linear regression problems, it is common to use gradient descent instead due to these limitations

Gradient Descent

The goal of gradient descent is still to minimize the cost function, but it follows an iterative process:

  1. Start with a random

  2. Calculate the gradient for the current

  3. Update as

  4. Repeat 2-3 until some stopping criterion is met

    where is the learning rate, or the size of step to take in the direction opposite the gradient.


Stochastic Gradient Descent

  • Standard or batch gradient descent uses the entire training set to calculate the gradient for each instance at every step
  • Stochastic Gradient Descent uses a single random instance at each step:
    1. Start with a random
    2. Pick a random instance (row in the design matrix)
    3. Calculate the gradient for the current and
    4. Update as
    5. Repeat 2-4 until some stopping criterion is met

Mini-batch Gradient Descent

bg right:40% 110%

  • Mini-batch gradient descent uses a random subset of the training set
  • Less chaotic than stochastic, but faster than batch
  • Most common type of gradient descent used in practice

Gradient Descent Hyperparameters

  • The learning rate - size of step taken
  • No rule that it needs to be constant! A simple learning schedule is to decrease over time, e.g.: where is the current iteration and and are hyper-parameters
  • For mini-batch, the batch size is another hyper-parameter
  • The number of epochs, or times to process the entire training set

Stopping Criteria

bg right:40% 110%

  • The simplest stopping criterion is to set a maximum number of epochs
  • Early stopping is another option:
    • Evaluate on a validation set at regular intervals
    • Stop when the validation error starts to increase
  • The comparison between training and validation performance can also help prevent overfitting

Loss functions

  • The loss function is the function being minimized by gradient descent

  • MSE is convex and guaranteed to have a single global minimum, but many other loss functions have multiple local minima

  • The relative scale of the features can affect the convergence:

    h:200 center

Figure from Scikit-Learn book

Higher-order Polynomials

  • Higher order polynomials can be solved with the Normal Equation as well:
  • Just include the higher order terms in
  • This is still a linear regression problem because the coefficients are linear!
  • Risk of overfitting the data
  • Easy way to regularize: drop one or more of the higher order terms

Regularization

  • If the model fits the training data too well, but doesn’t generalize to new data, it is overfitting
  • Regularization imposes additional constraints on the weights
  • Example: Ridge Regression adds a term to the loss function: where is the regularization parameter
  • The regularization term is only added during training, not evaluation

Note: the term cost function is often used instead of loss function


Logistic regression and beyond

Logistic regression is a binary classifier that uses the logistic function (aka sigmoid function) to map the output to a range of 0 to 1:

We can then minimize the log loss or cross-entropy loss function:

where is the probability that instance is positive.

More on cross-entropy loss when we talk about classification models

The gradient of the log loss ends up being:

  • There is no (known) analytical solution this time, but we can still use gradient descent!
  • In this case it’s still convex, so we don’t have to worry about local minima
  • In general, for a loss function to work with gradient descent, it must be:
    • Continuous and
    • Differentiable
    • … at the locations where you evaluate it

Next up: Backpropagation!

Lecture 4: Backpropagation

HTML Slides | PDF Slides

Backpropagation

COMP 4630 | Winter 2026 Charlotte Curtis


Overview

  • A brief review of the history of neural networks
  • Neurons, perceptrons, and multilayer perceptrons
  • Backpropagation
  • References and suggested reading:

The rise and fall of neural networks

In between each era of excitement and advancement there was an “AI winter”


Model of a neuron

  • McCulloch and Pitts (1943)
  • Neuron as a logic gate with time delay
  • “Activates” when the sum of inputs exceeds a threshold
  • Non-invertible (forward propagation only)

Threshold Linear Units (TLUs)

  • Linear I/O instead of binary
  • Rosenblatt (1957) combined multiple TLUs in a single layer
  • Physical machine: the Mark I Perceptron, designed for image recognition
  • Criticized by Minsky and Papert (1969) for its inability to solve the XOR problem - first AI winter

A single threshold logic unit (TLU)

Image source: Scikit-learn book


Training a perceptron

  • Hebb’s rule: “neurons that fire together, wire together”

    where = input, = output

  • Fed one instance at a time,

  • Guaranteed to converge if inputs are linearly separable

  • :abacus: Simple example: AND gate

A perceptron with two inputs and three outputs

Image source: Scikit-learn book


Multilayer perceptrons (MLPs)

  • If a perceptron can’t even solve XOR, how can it do higher order logic?

  • Consider that XOR can be rewritten as:

    A xor B = (A and !B) or (!A and B)
    
  • A perceptron can solve and and or and not… so what if the input to the or perceptron is the output of two and perceptrons?


A solution to XOR

h:500 center


Backpropagation

  • I just gave you the weights to solve XOR, but how do we actually find them?
  • Applying the perceptron learning rule no longer works, need to know how much to adjust each weight relative to the overall output error
  • Solution presented in 1986 by Rumelhart, Hinton, and Williams
  • Key insight: Good old chain rule! Plus some recursive efficiencies

Training MLPs with backpropagation

  1. Initialize the weights, through some random-ish strategy

  2. Perform a forward pass to compute the output of each neuron

  3. Compute the loss of the output layer (e.g. MSE)

  4. Calculate the gradient of the loss with respect to each weight

  5. Update the weights using gradient descent (minibatch, stochastic, etc)

  6. Repeat steps 2-5 until stopping criteria met

    Step 4 is the “backpropagation” part


Example: forward pass

  • With a linear activation function:
  • In summation notation for a single sample:
  • In this case,

center

and


Example: calculate error and gradient

  • We never picked a loss function! Let’s assume we’re using MSE
  • For a single sample: with the added for convenience
  • The goal is to update each weight by a small amount to minimize the loss
  • Fortunately, we know how to find a small change in a function with respect to one of the variables: the partial derivative!

Recursively applying the chain rule

  • Weights in the second layer (connecting hidden and output):

  • For the first layer (connecting inputs to hidden): where is the output of the hidden layer


Bias terms

  • The toy example did not include bias terms, but these are very important (as seen in the perceptron examples)
  • With a single layer we can add a column of 1s to , but with multiple layers we need to add bias at every layer
  • The forward pass becomes:
  • Or in summation form:

Gradient with respect to the bias terms

  • For layer 2 (the output layer):

  • For layer 1:

    where is the input to the hidden layer


Summary in matrix form

ParameterGradient
Weights of layer 2
Bias of layer 2
Weights of layer 1
Bias of layer 1

Computational considerations

  • Many of the terms computed in the forward pass are reused in the backward pass (such as the inputs to each layer)

  • Similarly, gradients computed in layer are reused in layer

  • Typically each intermediate value is stored, but modern networks are big

    ModelParameters
    Our example6
    AlexNet (2012)60 million
    GPT-3 (2020)175 billion

Choices in neural network design


Activation functions

  • The simple example used a linear activation function (identity)

  • To include other activation functions, the forward pass becomes:

  • The gradient in the output layer becomes: where , or the summation the second layer before applying the activation function

  • Problem! That step function in the original perceptron is not differentiable


Activation functions

  • A common early choice was the sigmoid function:
  • A more computationally efficient choice common today is the “ReLU” (Rectified Linear Unit) function:

Activation functions in hidden layers

The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles. – Deep Learning Book, Section 6.3

  • Activation functions in hidden layers serve to introduce nonlinearity
  • Common for multiple hidden layers to use the same activation function
  • Sigmoid, ReLU, and tanh (hyperbolic tangent) are common choices
  • Also “leaky” ReLU, Parameterized ReLU, absolute value, etc
  • Can be considered a hyperparameter of the network

Loss functions

  • The choice of loss function is very important!
  • Depends on the task at hand, e.g.:
    • Regression: MSE, MAE, etc
    • Classification: Usually some kind of cross-entropy (log likelihood)
  • May or may not include regularization terms
  • Must be differentiable, just like the activation functions

Activation functions in the output layer

  • Activation functions in the output layer should be chosen based on the loss function (and thus the task)
    • Regression: linear
    • Binary classification: sigmoid
    • Multiclass classification: softmax (generalization of sigmoid)
  • Again, must be differentiable

A complete fully connected network

center h:450


Next up: Classification loss functions and metrics

Lecture 5: Classification

HTML Slides | PDF Slides

Classification loss functions and metrics

COMP 4630 | Winter 2025 Charlotte Curtis


Overview

  • All the derivation thus far has been for mean squared error
  • Cross-entropy loss is more appropriate for classification problems
  • References and suggested reading:

Revisiting the expected value

The expected value of some function when is distributed as is given in discrete form as:

where the sum is over all possible values of .

In continuous form, this is an integral:


Binary case: Bernoulli distribution

  • If a random variable has a probability of being 1 and a probability of being 0, then is distributed as a Bernoulli distribution:
  • The expected value of is then:

Information theory

Originally developed for message communication, with the intuition that less likely events carry more information, defined for a single event as:

h:350 center

h:350 center


Entropy

bg right:35% fit

  • We can measure the expected information of a distribution as:
  • This is called the Shannon entropy
  • Measured in bits (base 2) or nats (base )
  • :abacus: Find the entropy of a bernoulli distribution

Cross-entropy

  • The KL divergence is a measure of the extra information needed to encode a message from a true distribution using an approximate distribution :
  • The cross-entropy is a simplification that drops the term :
  • Minimizing the cross-entropy is equivalent to minimizing the KL divergence
  • If , then and

Cross-entropy loss

bg right:40% fit

For a true label and predicted , the cross-entropy loss is:

where is the output of the final layer of a neural network (thresholded to obtain the prediction )

This is also called log loss or binary cross-entropy


Terminology for evaluation

  • True positive: predicted positive, label was positive () ✔️

  • True negative: predicted negative, label was negative () ✔️

  • False positive: predicted positive, label was negative () ❌ (type I)

  • False negative: predicted negative, label was positive () ❌ (type II)

  • Accuracy is the fraction of correct predictions, given as:


Precision and recall

  • Precision: Out of all the positive predictions, how many were correct?

  • Recall: Out of all the positive labels, how many were correct?

  • Specificity: Out of all the negative labels, how many were correct?


Confusion matrix

Predicted PositivePredicted Negative
True PositiveTPFN
True NegativeFPTN
  • The axes might be reversed, but a good predictor will have strong diagonals
  • There’s also the F1 score, or harmonic mean of precision and recall:
Check out the Wikipedia page for more ways of describing the same information

ROC Curves

  • The receiver operating characteristic curve is a plot of the true positive rate (recall or sensitivity) vs. false positive rate (1 - specificity) as the detection threshold changes

  • The diagonal is the same as random guessing

  • A perfect classifier would hug the top left corner

Fun fact: the name comes from WWII radar operators, where true positives were airplanes and false positives were noise


Which classifier is better?

center


Multiclass case

  • For classes, the output is a vector with
  • The cross-entropy loss is then:
  • For a one-hot encoded vector , this simplifies to: where is the index of the true class

The softmax function

  • For binary classification, the sigmoid function is used to predict the probability of the positive class

  • For multiclass classification, the softmax function is used:

    where is the output of neuron in the final layer before the activation function is applied

  • This means that neurons are needed in the final layer, one for each class


Next up: Convolution and NN frameworks

Lecture 6: Modern Neural Networks

HTML Slides | PDF Slides

Intro to modern neural networks

COMP 4630 | Winter 2024 Charlotte Curtis


Overview


Revisiting Backpropagation

  • For a network with layers, the gradients of the loss function with respect to the weights in the last layer are given by:

    assuming that the output is a function of layer ’s input .

  • At layer , the gradients are computed as:


  • At layer , this becomes:

  • And so on, until we reach the first layer.

  • We are recursively applying the chain rule and re-using the gradients computed at the previous layer

  • This is great for computational efficiency, but it can also lead to vanishing or exploding gradients


Vanishing and Exploding Gradients

  • Vanishing/exploding gradients are where the gradients become near zero or near infinity as they are propagated back through the network
  • Particularly problematic for recurrent neural networks, where the same weights are multiplied by themselves repeatedly
  • Also a problem for very deep networks, and part of the reason that deep learning was not popular until the 2010s
  • ❓ What changed?

Consider the variance

  • At the input layer, , and has some variance
  • Assume and are initialized to 0
  • :abacus: What is the variance of ?
  • What about after the activation function ?

Initialization strategies

  • In 2010, Glorot and Bengio proposed the Xavier initialization for a layer with inputs and outputs:
  • Goal is to preserve the variance of the input and output in both directions
  • Similar to LeCun initialization, and apparently an overlooked feature of networks from the 1990s

Initialization for ReLU

  • Glorot initialization was derived under the assumption of linear activation functions (even though they knew this wasn’t the case)
  • In 2015, He et al. proposed the He initialization specifically for ReLU activations:
  • The choice of normal vs uniform is apparently not very important
  • Default in PyTorch is

Batch normalization

  • Also in 2015, Ioffe and Szegedy proposed batch normalization as a way to mitigate vanishing/exploding gradients
  • This is simply a normalization at each layer, shifting and scaling the inputs to have a mean of 0 and a variance of 1 (across the batch)
  • A moving average of the mean and variance is maintained during training, and used for normalization during inference
  • It also ends up acting as regularization, magic!
  • ❓ Why wouldn’t you want to use batch normalization?

RELU and its variants

  • In early works, the sigmoid or tanh functions were popular
  • Both have a small range of non-zero gradients
  • ReLU has a stable gradient for positive inputs, but can lead to the dying ReLU problem whereby certain neurons are “turned off”
  • ❓ How can we prevent dying ReLUs?

Note: this may not be a problem, and ReLU is cheap. Don’t optimize prematurely unless you’re seeing lots of “dead” neurons.


Number of neurons and layers

  • Number of neurons in the input layer is defined by number of features
  • Number of neurons in the output layer is defined by prediction task
  • In between is a design choice
  • Common early choice was a pyramid shape, but it turns out that a stack of layers with the same number of neurons works well too
  • Deeper networks can solve more complex problems with the same number of total parameters, but are also prone to vanishing/exploding gradients

Ultimately, yet another a hyperparameter to be tuned


Optimization algorithms: variations on gradient descent

  • Gradient descent takes small regular steps, constant or otherwise
  • Many variations exist! For example, momentum keeps track of the previously computed gradient and uses it to inform the new step: where is a hyperparameter between 0 and 1
  • Adaptive moment estimation (Adam) is a popular choice that adds on an exponentially decaying average of the squared gradients

The Adam optimizer

  • Keeps track of first () and second () moments of the gradient, with two exponential decay terms and
  • At each time step, the update is now:
  • and are typically 0.9 and 0.999, respectively

Regularization via dropout

  • Dropout is a regularization technique that randomly sets a fraction of the neurons to zero during training
  • During each training pass, a neuron has a probability of being dropped
  • Similar to training an ensemble of models then bagging
  • Helps to prevent overfitting, but can slow down training
  • Typical values: 0.5 for hidden layers, 0.2 for input layers

Choices for starting

From the Deep Learning Book, section 11.2:

  • “Depending on the complexity of your problem, you may even want to begin without using deep learning.”
  • Tabular data: fully connected, images: convolutional, sequences: recurrent*
  • ReLU or variants, with He initialization
  • SGD or Adam, add batch normalization if unstable
  • Use some kind of regularization, such as dropout
*This book was written before transformer models

Implementation time

Lecture 7: Convolutional Neural Networks

HTML Slides | PDF Slides

Convolution and Convolutional Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis


Overview

  • Convolutional neural networks (CNNs) are a type of neural network that is particularly well-suited to image data
  • Before we can understand CNNs, we need to understand convolution
  • References and suggested reading:

Convolution

  • Convolution is defined as:

  • Or in the discrete case:

  • Can be though of as “flipping” one function and sliding it over the other, multiplying and summing at each point


Example

We’re in a hospital dealing with an outbreak. For the first 5 days we have 1 patient on Monday, 2 on Tuesday, etc:

Fortunately, we know how to treat them: 3 doses on day 1, then 2, then 1:

And after 3 days they’re cured.

How many doses do we need on each day?

Hospital example taken from here

Convolution in 2D

  • Extending to 2D basically adds another summation/integration:
  • This can also be extended to higher dimensions
  • Caution: a colour image is a 3D array, not a 2D array
  • For typical image processing applications, the colour channels are convolved independently such that the output is still a 3D array

Convolution kernels

  • Typically there is a small kernel that is convolved with the input
  • This is just the smaller of the two functions in the convolution
  • ❓ What happens at the edges of the input?

center


Some common kernels

  • Averaging:
  • Differentiation:
  • Sizes are commonly chosen to be 3x3, 5x5, 7x7, etc.
  • ❓ Why divide by 9?
  • ❓ Why odd sizes?
  • ❓ What effect do you think these kernels will have on an image?

A side tangent on frequency representation

  • Any signal can be represented as a weighted summation of sinusoids
  • For a discrete signal , you can think of this as:
  • Or, using Euler’s formula : where the complex coefficients
j is the same as i, but in engineering

Fourier Transform

  • To figure out what the coefficients are, we can use the Discrete Fourier Transform (DFT): where each element of is the coefficient for frequency
  • The Fast Fourier Transform (FFT) computes the DFT in time
  • Convolution is

Convolution is multiplication in frequency

  • Sometimes it is useful to use the Fourier Transform to obtain a frequency representation of an image (or signal)
  • In the frequency domain, convolution is simply element-wise multiplication
  • This allows for some efficient operations such as blurring/sharpening an image, as well as some fancy stuff like deconvolution
  • Sharp edges in space become ringing in frequency, and vice versa
  • ❓ why do CNNs operate in the spatial domain?

Convolutional neural networks

bg right:30% fit

  • 1958: Hubel and Wiesel experiment on cats and reveal the structure of the visual cortex
  • Determine that specific neurons react to specific features and receptive fields
  • Modelled in the “neocognitron” by Kunihiko Fukushima in 1980
  • LeCun’s work in the 1990s led to modern CNNs

Why CNNs?

  • A fully connected network has a 1:1 mapping of weights to inputs
  • Fine for MNIST (28x28) pixels, but quickly grows out of control
  • ❓ If you train a (fully connected) network on 100x100 images, how would you infer on 200x200 images?
  • ❓ What if an object is shifted, rotated, or flipped within the image? h:200 center
Bluey image used without permission from bluey.tv

Convolutional layers

  • A convolution layer is a set of kernels whose weights are learned
  • Instead of the straight up weighted sum of inputs, the input image is convolved with the learned kernel(s)
  • The output is often referred to as a feature map
  • The dimensionality of the feature map is determined by the:
    • Size of the input image
    • Number of kernels
    • Padding (usually “same” or “valid”)
    • “Stride”, or shift of the kernel at each step

Dimensionality examples

InputKernelStridePaddingOutput
1same
2same
1valid???
  • The number of channels has no impact on the depth of the output: the number of kernels determines the depth of the output
  • The colour channels are convolved independently, then summed

Number of parameters example

  • Input:
  • Kernel:
  • Bias terms: 32
  • Total parameters:

While convolution only happens in 2D, the kernel can be thought of as a 3D volume - there’s a separate trainable kernel for each channel


Pooling layers

Pooling layers are used to reduce the dimensionality of the feature maps (aka downsampling) by taking the maximum or average of a region

center h:300

  • ❓ Why would we want to downsample?

Putting it all together


Backpropagating CNNs

  • After backpropagating through the fully connected head, we need to backpropagate through the:
    • Pooling layer
    • Convolution layer
  • The pooling layer has no learnable parameters, but it needs to remember which input was maximum (all the rest get a gradient of 0)
  • The convolution layer is trickier math, but it ends up needing another convolution - this time with the kernel transposed

LeNet (1998) vs AlexNet (2012)

h:500 center


Inception v1 aka GoogLeNet (2014)

height:500 center

height:300

  • Struggled with vanishing gradients
  • v2 introduced batch normalization

VGG (2014)

center h:400

  • Simple architecture, but “very” deep (16 or 19 layers)
  • Fixed the convolution hyperparameters and focused on depth

ResNet (2015)

center

  • Key innovation: easier to learn “identity” functions ()
  • If a layer outputs 0, it doesn’t kill the gradient
  • Even deeper, e.g. ResNet-152

Transfer Learning

“If I have seen further, it is by standing on the shoulder of giants” – Isaac Newton

  • Transfer learning copy pastes a trained network into a new task
  • You can select which layers to keep, which to freeze, and which to re-train
  • You can also drop new layers on top of the old ones
  • Most of the time you want to freeze the early layers and add a new “head”

Data Augmentation

  • Garbage in, garbage out

  • We can artificially increase diversity with data augmentation:

    • Random crops, flips, rotations
    • Rescaling/resizing
    • Changing colours
  • AutoAugment does a bunch of this automatically


Next up: RNNs

Lecture 8: Recurrent Neural Networks

HTML Slides | PDF Slides

Recurrent Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis


Overview


Sequence data

  • So far we’ve been talking about images, tabular data, and other “static” data
  • ❓ What are some examples of sequence data?

h:350


Non-RNN Approaches

bg right fit

As usual, you don’t always need a deep learning solution :hammer:

  • ❓ What is an example of a “naive” approach?
  • ❓ What are some limitations of naive approaches?

Autoregressive Moving Average

  • Models to predict time series with a weighted average of past value where
  • Key assumption: data is stationary (mean and variance don’t change)
  • ARIMA adds on “integration” or “differencing” to account for trends

ARIMA

  • Autoregressive parameter : How many steps back to average?
  • Moving average parameter : How many previous errors to average?
  • Integrative parameter : How many “differencing” rounds to perform before applying ARMA?

    can be thought of as approximating the order polynomial


  • ❓ Are there any obvious trends in the data?
  • ❓ What about non-obvious trends?
  • ❓ How might this dataset be treated differently from the previous one?


Feedforward vs recurrent networks

  • Feedforward: data flows in one direction (then backpropagated)
  • Recurrent: data can flow in loops center
Figure from Scikit-learn textbook

Recurrent layers

  • The simplest recurrent layer has a single feedback connection where is the activation function and and are weight matrices
  • “Backpropagation through time” (BPTT) is exactly the same as regular backpropagation through the unrolled network
  • ❓ What kind of issues might arise during training?
  • ❓ What are some limitations of this approach?
  • ❓ How can we deal with for ?

Preparing data for RNNs

  • The data format depends on the task, e.g. do you want to predict:
    • The next value in a sequence (e.g. predictive text)
    • The next values in a sequence (e.g. stock prices)
    • The next sequence in a set of sequences (e.g. language translation)
  • Let’s start with predicting the next value in a sequence

center


Activation Functions for RNNs

  • The default activation function in tensorflow/PyTorch is tanh
  • ❓ What is different about RNNs that might influence the choice of activation function?
  • ❓ How might we normalize sequence data?

Beyond the “next value”

  • Option 1: Use the single-prediction RNN repeatedly
  • Option 2: Train the RNN to predict multiple values at once
    • Easy change model-wise, but data preparation is trickier
    • n inputs, n outputs
  • Option 3: Use a “sequence to sequence” model
    • Even trickier data preparation, but n inputs are predicted at each time step instead of just at the end

Seq2seq input/target examples

InputTarget
1[0, 1, 2][1, 2, 3]
2[0, 1, 2][[1, 2], [2, 3], [3, 4]]
3[0, 1, 2][[1, 2, 3], [2, 3, 4], [3, 4, 5]]

Problems with long sequences

  • Gradient vanishing/exploding
    • Choose activation functions and initialization carefully
    • Consider “Layer normalization” (across features)
  • “Forgetting” early data
    • Skip connections through time
    • “Leaky” RNNs
    • Long short-term memory (LSTM)
  • Computational efficiency and memory constraints
    • Gated recurrent units (GRUs)

Skip connections and leaky RNNs

  • Simple way of preserving earlier data:
  • Vanilla RNN: depends on only
  • Skip connection: depends on , , , etc.
  • Leaky RNN has a smooth “self-connection” to dampen the exponential:
  • Not common approaches anymore, as LSTM, GRU, and especially attention mechanisms are more popular

Long Short-Term Memory (LSTM)

h:500 center

Figure from Scikit-learn textbook

Gated Recurrent Units (GRUs)

h:500 center

Figure from Scikit-learn textbook

Next up: Natural Language Processing


Preview: Natural Language Processing

  • Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language.
  • RNNs are widely used in NLP tasks such as language modeling, machine translation, sentiment analysis, and text generation.
  • Language modeling involves predicting the next word in a sequence of words, which can be done using RNNs.
  • Machine translation uses RNNs to translate text from one language to another.
  • Sentiment analysis aims to determine the sentiment or emotion expressed in a piece of text, and RNNs can be used for this task.
  • Text generation involves generating new text based on a given input, and RNNs are commonly used for this purpose.
This slide written by GitHub Copilot in Winter 2024

Preview: Natural Language Processing

  • What is Natural Language Processing (NLP)?
  • Common NLP tasks:
    • Language modeling
    • Machine translation
    • Sentiment analysis
    • Text generation
  • How RNNs are applied in NLP
This slide written by GitHub Copilot in Winter 2025

Preview: Natural Language Processing

  • NLP in 2026 is dominated by large language models (LLMs) like GPT-4o, Claude, and Gemini
  • Transformer-based architectures have largely replaced RNNs for most NLP tasks
  • Key capabilities of modern NLP systems:
    • Multi-modal understanding (text, images, audio, video)
    • Long-context reasoning (millions of tokens)
    • Agentic behaviour: tool use, planning, and self-correction
  • ❓ If transformers have replaced RNNs, why are we still studying them?
This slide written by GitHub Copilot in Winter 2026

Lecture 9: Natural Language Processing

HTML Slides | PDF Slides

Natural Language Processing

COMP 4630 | Winter 2026 Charlotte Curtis


Overview


Tokenization

  • Consider the sentence:

    “The cat sat on the mat.”

  • This can be split up into individual words or tokens:

    [“The”, “cat”, “sat”, “on”, “the”, “mat”, “.”]

  • ❓ what other ways could we tokenize this sentence?

  • ❓ what about punctuation, capitalization, etc.?


RNN + tokens: predict the next character

  • Just like predicting the next day’s weather or stock price, we can predict the next character in a sentence using an RNN
  • Input tokens: ['T', 'h', 'e', ' ', 'c', 'a', 't', ' ', 's', 'a', 't', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 'm', 'a', 't', '.']
  • Numeric representation: [20, 8, 5, 0, 3, 1, 20, 0, 19, 1, 20, 0, 15, 14, 0, 20, 8, 5, 0, 13, 1, 20, 2]
  • We could train an RNN model to predict the next character
  • For more info check out Andrej Karpathy’s blog post, one of the sources for the Scikit-learn chapter

Repeatedly predicting the next character

  • To predict whole sentences from a starting point, we can predict the next character and append it to the input, then predict again
  • In practice this tends to get stuck in loops: Input: "to be or not" Output: "to be or not to be or not to be or not..."
  • ❓ how might we avoid this?
  • ❓ could we just predict the next whole word instead?

Controlled chaos: softmax temperature

  • The softmax is defined as: where is a vector of logits, or log probabilities
  • This estimates the probability of class out of classes
  • Adding a temperature parameter :

Temperature Example

  • Vocab = ["to", "be", "or"]
  • Assume that the “logits”
  • Sample the next word from the resulting distribution

alt text


In the beginning, there were -grams

  • A simple way to represent text is as a bag of -grams

  • unigram: single words (aka “Bag of Words”):

    [“the”, “cat”, “sat”, “on”, “the”, “mat”]

  • bigram: pairs of words:

    [“the cat”, “cat sat”, “sat on”, “on the”, “the mat”]

  • trigram: triples of words:

    [“the cat sat”, “cat sat on”, “sat on the”, “on the mat”]


Predictive text with -grams

  • Given a sequence of tokens, we can predict the probability of the th token given the previous tokens:

  • Each of these conditional probabilities can be estimated from the frequency of the -grams in a corpus

  • The most likely next word is the one with the highest probability

  • ❓ What are some limitations of this approach?


-grams challenges

  • -grams lose the meaning of words:
    • “The cat sat on the mat”
    • “The dog sat on the rug”
  • Also subject to the curse of dimensionality
    • Vocabulary with size leads to possible -grams
    • Most -grams will not be present in the corpus!
  • ❓ Can you think of an -gram modification that could help this problem?

Side note: The curse of dimensionality

  • Data is often represented as samples with features each
  • As increases, the number of samples required to cover the space increases exponentially
  • Also called problem h:300 center
Figure 5.9 from the Deep Learning Book

Word embeddings

  • Alternative solution: represent individual words as vectors, or embeddings

  • ❓ How are these embeddings defined?

    WordEmbedding
    cat[0.2, 0.3, 0.5]
    dog[0.1, 0.4, 0.4]
    mat[0.5, 0.2, 0.2]
    rug[0.4, 0.1, 0.1]


Learning word embeddings

  • Wednesday we’ll discuss the influential Word2Vec paper, but it wasn’t the first time embeddings were learned as part of a network
  • The concept was first presented successfully by Bengio in 2001 center h:350

So we have embeddings, now what?

  • We can use these embeddings as input to a neural network
  • Applications:
    • Sentiment analysis: is a review/tweet/comment positive or negative?
    • Named entity recognition: who/what is mentioned in a text?
    • Machine translation: convert text from one language to another
    • Predictive text: what word comes next?
    • Text generation: create new text based on a given input

Sentiment analysis

General process:

  • Standardize and tokenize the text
  • Add an embedding layer (trainable or pre-trained)
  • Add a recurrent layer, such as a GRU
  • Add a dense layer with sigmoid activation

To Colab!

This is the process you’ll be following for Assignment 3


Sequence to Sequence models


Back to RNNs

  • RNNs predict the future based on the past

  • This is exactly what we want for predicting stock prices, weather, etc

  • ❓ What about translating a sentence from one language to another?

    Time flies like an arrow; fruit flies like a banana.

  • ❓ Can you think of a way to get RNNs to see the future?


Bidirectional RNNs

  • Simple approach: just reverse the sequence

bg right fit

Figure 10.11 from the Deep Learning Book

Pretraining

  • Embeddings like Word2Vec have been trained on large corpora
  • Surely this provides a great starting point for our models!
    • ❓ what are some potential drawbacks?
  • ELMo was introduced in 2018 specifically to address the limitations of Word2Vec and GloVe (another popular embedding)

“Our representations differ from traditional word type embeddings in that each token is assigned a representation that is a function of the entire input sentence. We use vectors derived from a bidirectional LSTM that is trained with a coupled language model objective on a large text corpus” – Peters et al


Subword Tokenization

  • Word embeddings are great, but still have limitations
  • ELMo uses character tokenization to handle out-of-vocabulary words
  • In between characters and words are subwords
    • "This warm weather is enjoyable"
    • "This", "warm", "weath", "er", "is", "enjoy", "able"
  • Byte Pair Encoding is the most common subword tokenization method, used by GPT and BERT
  • ❓ What are some advantages of subword tokenization?

Machine Translation

EnglishSpanish
My mother did nothing but weepMi madre no hizo nada sino llorar
Croatia is in the southeastern part of EuropeCroacia está en el sudeste de Europa
I would prefer an honorable deathPreferiría una muerte honorable
I have never eaten a mango beforeNunca he comido un mango
  • ❓ What kind of challenges can you think of?

Encoder-Decoder Models

  • RNNs can convert an arbitrary length sequence into a fixed length vector
  • RNNs can convert a fixed length vector into an arbitrary length sequence
  • Why not use two RNNs to convert a sequence to a sequence?
  • The output head is a softmax layer with one node for each word in the target vocabulary

h:600 center


Teacher Forcing

  • This model uses teacher forcing to train the decoder
  • It feels like cheating, but this involves feeding the correct output to the decoder at each time step
  • This speeds up training and can improve performance
  • Avoids the whole backpropagation through time thing and makes training of RNNs parallelizable
  • ❓ What are the implications at inference time?

Coming up next: Attention mechanisms and Transformers

Lecture 10: Transformers

HTML Slides | PDF Slides

Transformers and Large Language Models

COMP 4630 | Winter 2026 Charlotte Curtis


Overview

  • Attention mechanisms
  • Transformers and large language models
    • Multi-head attention, positional encoding, and magic
    • BERT, GPT, Llama, etc.
  • References and suggested readings:

Attention Overview

bg right fit

  • Basically a weighted average of the encoder states, called the context
  • Weights usually come from a softmax layer
  • ❓ what does the use of softmax tell us about the weights?

Encoder-Decoder with Attention

bg left fit

  • The alignment model is used to calculate attention weights
  • The context vector changes at each time step!

Attention Example

bg right fit

  • Attention weights can be visualized as a heatmap
  • ❓ What can you infer from this heatmap?

The math version

The context vector at each step is computed from the alignment weights as:

where is the encoder output at step and is computed as:

where is the alignment score or energy between the decoder hidden state at step and the encoder output at step .


Different kinds of attention

  • The original Bahdanau attention (2014) model: where , , and are learned parameters
  • Luong attention (2015), where the encoder outputs are multiplied by the decoder hidden state (dot product) at the current step
  • ❓ What might be a benefit of dot-product attention?

Attention is all you need

bg right:30% 90%

  • Some Googlers wanted to ditch RNNs
  • If they can eliminate the sequential nature of the model, they can parallelize training on their massive GPU (TPU) clusters
  • Problem: context matters!
  • ❓ How can we preserve order, but also parallelize training?

Transformers

bg left fit

  • Still encoder-decoder and sequence-to-sequence
  • encoder/decoder layers
  • New stuff:
    • Multi-head attention
    • Positional encoding
    • Skip (residual) connections and layer normalization

Multi-head attention

bg right:30% 80%

  • Each input is linearly projected times with different learned projections
  • The projections are aligned with independent attention mechanisms
  • Outputs are concatenated and linearly projected back to the original dimension
  • Concept: each layer can learn different relationships between tokens

What are V, K, and Q?

  • Attention is described as querying a set of key-value pairs
  • Kind of like a fuzzy dictionary lookup
  • is an matrix, where is the dimension of the keys
  • is an matrix
  • is an matrix
  • The product is , representing the alignment score between queries and keys (dot product attention)

The various (multi) attention heads

  • Encoder self-attention (query, key, value are all from the same sequence)
    • Learns relationships between input tokens (e.g. English words)
  • Decoder masked self-attention:
    • Like the encoder, but only looks at previously generated tokens
    • Prevents the decoder from “cheating” by looking at future tokens
  • Decoder cross-attention:
    • Decodes the output sequence by “attending” to the input sequence
    • Queries come from the previous decoder layer, keys and values come from the encoder (like prior work)

Positional encoding

  • By getting rid of RNNs, we’re down to a bag of words
  • Positional encoding re-introduces the concept of order
  • Simple approach for word position and dimension :
  • resulting vector is added to the input embeddings
  • ❓ Why sinusoids? What other approaches might work?

Other positional encodings

  • Appendix D of the new version of Hands-on Machine Learning goes into detail about positional encodings, particularly for very long sequences
  • Sinusoidal encoding is no longer used in favour of relative approaches
    • Introduces locality bias
    • Removes early-token bias
  • Example: learnable bias for position in , clamped to a max
  • Only bias terms to learn, regardless of sequence length

Interpretability

bg right fit

  • The arXiv version of the paper has some fun visualizations
  • This is Figure 5, showing learned attention from two different heads of the encoder self-attention layer

A smattering of Large language models

  • GPT (2018): Train to predict the next word on a lot of text, then fine-tune
    • Used only masked self-attention layers (the decoder part)
  • BERT (2018): Bidirectional Encoder Representations from Transformers
    • Train to predict missing words in a sentence, then fine-tune
    • Used unmasked self-attention layers only (the encoder part)
  • GPT-2 (2019): Bigger, better, and capable even without fine-tuning
  • Llama (2023): Accessible and open source language model
  • DeepSeek (2024): Significantly more efficient, but accused of being a distillation of OpenAI’s models

Hugging Face

  • Hugging Face provides a whole ecosystem for working with transformers

  • Easiest way is with the pipeline interface for inference:

    from transformers import pipeline
    nlp = pipeline("sentiment-analysis") # for example
    nlp("Cats are fickle creatures")
    
  • Hugging Face models can also be fine-tuned on your own data


And now for something completely different: Deep reinforcement learning

Lecture 11: Reinforcement Learning

HTML Slides | PDF Slides

(Deep) Reinforcement Learning

COMP 4630 | Winter 2026 Charlotte Curtis


Overview

  • Terminology and fundamentals
  • Q-learning
  • Deep Q Networks
  • References and suggested reading:

Reinforcement Learning + LLMs

h:500 center


Terminology

  • Agent: the learner or decision maker
  • Environment: the world the agent interacts with
  • State: the current situation
  • Reward: feedback from the environment
  • Action: what the agent can do
  • Policy: the strategy the agent uses to make decisions

Classic example: Cartpole


The Credit Assignment Problem

  • Problem: If we’ve taken 100 actions and received a reward, which ones were “good” actions contributing to the reward?
  • Solution: Evaluate an action based on the sum of all future rewards
    • Apply a discount factor to future rewards, reducing their influence
    • Common choice in the range of to
    • Example of actions/rewards:
      • Action: Right, Reward: 10
      • Action: Right, Reward: 0
      • Action: Right, Reward: -50

Policy Gradient Approach

  • If we can calculate the gradient of the expected reward with respect to the policy parameters, we can use gradient descent to find the best policy
  • Loss function example: REINFORCE (Williams, 1992) where:
    • = output probability for action given state
    • = reward at timestep
    • = model parameters

Value-based methods

  • Policy gradient approach is direct, but only really works for simple policies
  • Value-based methods instead explore the space and learn the value associated with a given action
  • Based on Markov Decision Processes (review from AI?)

Markov Chains

  • A Markov Chain is a model of random states where the future state depends only on the current state (a memoryless process) center
  • Used to model real-world processes, e.g. Google’s PageRank algorithm
  • ❓ Which of these is the terminal state?
Figure 18-7 from the Scikit-learn book

Markov Decision Processes

center

  • Like a Markov Chain, but with actions and rewards
  • Bellman optimality equation:

Iterative solution to Bellman’s equation

Value Iteration:

  1. Initialize for all states
  2. Update using the Bellman equation
  3. Repeat until convergence

Problem: we still don’t know the optimal policy


Q-Values

Bellman’s equation for Q-values (optimal state-action pairs):

Optimal policy :

For small spaces, we can use dynamic programming to iteratively solve for


Q-Learning

  • Q-Learning is a variation on Q-value iteration that learns the transition probabilities and rewards from experience
  • An agent interacts with the environment and keeps track of the estimated Q-values for each state-action pair
  • It’s also a type of temporal difference learning (TD learning), which is kind of similar to stochastic gradient descent
  • Interestingly, Q-learning is “off-policy” because it learns the optimal policy while following a different one (in this case, totally random exploration)

Q-Learning Update rule

  • At each iteration, the Q estimate is updated according to:

  • Where:

    • is the estimated value of taking action in state
    • is the learning rate (decreasing over time)
    • is the immediate reward
    • is the discount factor
    • is the maximum Q-value for the next state

Exploration policies

  • ❓ How do you balance short-term rewards, long-term rewards, and exploration?
  • Our small example used a purely random policy
  • -greedy chooses to explore randomly with probability , and greedily with probability
  • Common to start with high and gradually reduce (e.g. 1 down to 0.05)

Challenges with Q-Learning

  • ❓ We just converged on a 3-state problem in 10k iterations. How many states are in something like an Atari game?
  • ❓ How do we handle continuous state spaces?

One approach: Approximate Q-learning:

  • approximates the Q-value for any state-action pair
  • The number of parameters can be kept manageable
  • Turns out that neural networks are great for this!

Deep Q-Networks

  • We know states, actions, and observed rewards
  • We need to estimate the Q-values for each state-action pair
  • Target Q-values:
    • is the observed reward, is the next state
    • is the network’s estimate of the future reward
  • Loss function:
  • Standard MSE, backpropagation, etc.

Challenges with DQNs

  • Catastrophic forgetting: just when it seems to converge, the network forgets what it learned about old states and comes crashing down
  • The learning environment keeps changing, which isn’t great for gradient descent
  • The loss value isn’t a good indicator of performance, particularly since we’re estimating both the target and the Q-values
  • Ultimately, reinforcement learning is inherently unstable!

The last topic: Geneterative AI and ethics


bg bg

Generated using MidJourney in Winter 2024

GenAI + Ethics Discussion

  • Generative images have gotten really good
  • What can we do? What should we do?

Lecture 12: Ethics Discussion

HTML Slides | PDF Slides

Generative AI, ethics, and policies

COMP 4630 | Winter 2026 Charlotte Curtis


Guiding questions for discussion

  • What are some ways that AI can be helpful? Harmful?
  • What are some ways that AI is used?
    • Axes: intent (malicious <-> innocent), impact (harmful <-> helpful)
  • What concerns do you have about AI in general?
    • What about in education, specifically at MRU?
  • What would you like to see in an AI policy at MRU?

Case study: vibe coding

bg right fit

Costs* of:

*Carbon footprint, $$, scalability, security, etc

bg fit


April fools?


Impact


Coming up next

  • Easter break, no class on Monday
  • Wednesday: One last journal club presentation (YOLO) and a smattering of additional topics (autoencoders, object detection, image generation?)
  • Monday: last day of class! Assignment 3 results, project checkpoint discussions. Maybe some kind of treats.

Tutorial 1: Data exploration and wrangling

Before building a machine learning model, it is important to understand and wrangle your data into an appropriate numeric format. In this tutorial, we’ll look at how I like to set up my projects and some tips for exploratory visualizations.

Part 1: Project configuration

Tutorials are not marked in this course, so it’s up to you to keep track of them separately. I recommend copying the this directory to a new location rather than forking the entire w26 repo and working out of that, otherwise you’ll have a bunch of merge conflicts and extra stuff if you want to submit a PR to the main repo. Alternatively, you can create a fork and work within a separate branch.

Tools:

Part 2: Exploratory visualizations

Visualizations for the purposes of exploring data (rather than communicating results) can be “quick and dirty”, but there are some guidelines to consider, as well as a few tricks that can help.

Follow along with the notebook and answer the various TODOs.

Part 3: Reverse engineer a cleaned dataset

  1. Create a new .ipynb file to explore this new dataset
  2. Read the raw data into a pandas DataFrame. You can either download the zip file, or install the ucimlrepo package and fetch the data directly.
  3. Read the pre-processed version into a different pandas DataFrame.
  4. Try to answer the following questions:
    1. How were the categorical features handled?
    2. Were any of the numerical categories manipulated?
    3. What additional transformations might be useful for this dataset?

Tutorial 2: Linear algebra and NumPy exercises

Linear algebra is foundational to machine learning, and NumPy is a mature Python library that allows you to work efficiently with matrices and vectors.

For this tutorial, I’ll be providing a printed worksheet with some exercises to do by hand. You can then check your answers using NumPy. This serves to both refresh your memory on linear algebra as well as gain some familiarity working with NumPy.

NumPy Basics

By convention, NumPy is imported with the name np:

import numpy as np

You can then create an N-dimensional array by passing a standard Python list to np.array:

v = np.array([1, 2]) # a 1-D vector
print(v.shape) # prints (2, )
A = np.array([[1, 2], [3, 4]]) # a 2-D matrix
print(A.shape) # prints (2, 2)

The default multiplication operator is element-wise. If you want to use matrix multiplication, use the @ operator, or dot function for vectors:

print(A * v)
print(A @ v)
print(v.dot(v)) # or np.dot(v, v)

Output:

[[1 4]
 [3 8]]

[5, 11]

5

Transposing a matrix is quite simple, but the inverse needs the linalg submodule:

print(A.T) # Transpose
print(np.linalg.inv(A)) # Matrix inverse

Output:

[[1 3]
 [2 4]]

[[-2.   1. ]
 [ 1.5 -0.5]]

Similarly, linalg has useful functions like norm, det, solve… getting familiar with the docs can be handy!

More resources

Gradient Descent for Polynomial Regression

Note

Solution now available! You can view it rendered on GitHub here.

There’s some fake data in the file data.csv, with a single feature x and a true value y. Your task is to:

  1. Load the data and look at it
  2. Split it into training, validation, and test sets
  3. Create your design matrix
  4. Implement gradient descent to find the best fit polynomial
  5. Evaluate your model’s performance and experiment with different hyperparameters

It’s up to you to decide what degree polynomial to fit the data, and you can also play around with stochastic gradient descent, mini-batch, hyperparameters, etc.

Important

Do this without the use of scikit learn or other libraries aside from numpy and matplotlib!

Step 0: Import libraries and seed your random number generator

It’s usually a good idea to start with a consistent random number seed to ensure reproducibility.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed="integer_of_your_choice")

Step 1: Load the data and look at it

x, y = np.loadtxt("data.csv", delimiter=",", skiprows=1, unpack=True)
#TODO: visualize

Step 2: Split the data

Weird numpy quirk: by default, a 1D array has a shape of (n,), but to behave as a proper vector, we need to convert it to be (n, 1). An easy way to do this is to pass np.newaxis as the second index when sampling your y data, e.g.:

n = len(y)
train_ids = rng.choice()
x_train, y_train = x[train_ids,], y[train_ids, np.newaxis]

Don’t worry about the x values for now, as we’ll be matrixifying them shortly anyway.

Step 3: Create your design matrix .

For the example given in class, the design matrix was simply a column of 1s concatenated with the feature vector, i.e.:

For this exercise, you probably want to fit a higher degree polynomial, so the design matrix will be something like:

where is the degree of the polynomial you want to fit. Try multiple degrees and see what gives the best results.

A note on scaling: the range of x values in this example is fairly small, but if you choose a high degree polynomial you will still end up with fairly different scales for your “features”. Consider normalizing each column of the design matrix (other than the first column accounting for the bias term), remembering to calculate your scaling parameters on the training data and apply them to the validation/test data.

Since you’ll be doing this twice (train/test), you might want to define a function to create the design matrix given a vector x and a degree d.

Step 4: Implement gradient descent

This has a number of sub components. First you’ll need to define your gradient function. For mean squared error, the gradient can be calculated as:

where is your design matrix, is the current parameter vector, and is the true target value.

It’ll also be useful to define the actual mean squared error to evaluate your model:

Now you can define your hyperparameters and run your gradient descent. For batch gradient descent, you’ll need to define:

  • learning rate (usually in the range of to )
  • stopping criterion (can just be a fixed number of iterations)

The general algorithm for gradient descent is:

  1. Start with a random
  2. Calculate the gradient for the current
  3. Update as
  4. Repeat 2-4 until some stopping criterion is met

You could also try mini-batch or stochastic gradient descent by adding an outer epoch loop if you want to get fancy.

Step 5: Evaluate your model’s performance and experiment

Now that you’ve computed a final estimate of , apply it to your test set to see how well your model performs, perhaps by plotting the data as well as the best fit curve. If it doesn’t look good, try changing various hyperparameters, like , number of iterations, and degree of polynomial. If you didn’t rescale your design matrix earlier, try it now!

Technically we should have done a 3-way train/validate/test split, but I kept it as just train/test to keep things manageable.

Backpropagation with a toy MLP

Before we move on to a full-featured toolbox, I wanted to provide you with something a bit simpler. I’ve written some (questionable) code in mlp_regressor.py to try to implement a multi-layer perceptron. I’ve also provided a starter notebook after throwing you to the wolves last week - you can download your copy here, or from GitHub.

Step 1: Load and preprocess data

We’ll use a well-known and fairly clean dataset to try to predict wine quality. You’ll still need to encode (or ignore) the one categorical feature color, then split and normalize the inputs. You’ll also need to pip install ucimlrepo to get the data-fetching module.

Step 2: Build and train an MLP

The MLPRegressor class should be able to train a small multi-layer perceptron. You can use it like this:

from mlp_regressor import MLPRegressor
mlp = MLPRegressor(X_train.shape[1])
mlp.add_layer(<number of neurons>, "activation function")
... repeat
print(mlp) # to see a summary of layers

loss = mlp.train(X_train, y_train, step_size, epochs)
plt.plot(loss)

It’s very inefficient, so don’t go too crazy with number of neurons. After training, you can predict by just running the forward pass:

y_pred = mlp.forward(X_train)

There’s also an example in the main block of mlp_regressor.py

Step 3: Modify the MLP

Try to read through the forward and backward passes to understand how it works. It’s entirely possible I’ve made a mistake somewhere, so don’t hesitate to ask if something doesn’t make sense.

To understand things in more detail, it can be helpful to try to modify it. Right now, the MLP only does whole-batch gradient descent. Can you modify it so that it does mini-batch or stochastic gradient descent?

Modern neural networks with PyTorch

In this lab, I’m going to be walking through the PyTorch intro code that we started in lecture before the midterm. This is a pretty small model so it should be feasible on lab computers or laptops, but you might want to get used to using Colab as well.

I’ll also introduce you to Assignment 2!

RNN Activity: Predicting the stock market

[!NOTE] I do not recommend making any kind of financial decisions based on RNNs.

Setup

To fetch the data, we’ll need the yfinance package. Activate your virtual environment, then run pip install yfinance. The starter notebook shows how to download some data.

Steps

  1. As usual, we need to split the data. For time series data, it is very important that you split chronologically, not randomly. The yfinance data has a DateTimeIndex, which means that we can index it with dates directly, e.g:
     train_end = '2023-12-31'
     train = data['Close'][:train_end].values
    
    This also assumes that we’re interested in just the daily closing price. I’d suggest at least a year’s worth of data for both test and validation, with the rest for training.
  2. Decide on a scaling factor and rescale your data to the 1-ish range. This should not be informed by the test or validation data!
  3. Follow the various TODO items to get the single-prediction RNN working.
  4. Modify both the TimeSeriesDataset and the SimpleRnnModel to predict multiple days in the future instead of just one.

    [!TIP] You may need to squeeze and/or expand your data at some point in the process to ensure the data dimensions are correct

Further Exercises and Questions

  1. What impact does the scaling have on the model? Can you trigger exploding/vanishing gradients?
  2. Can you improve the model? Try playing around with LSTM/GRU instead of RNN, number of hidden units, activation functions, etc
  3. What happens if you pass a longer time window to the model? You will need to create a new torch.FloatTensor from the numpy data to do this instead of using the TimeSeriesDataset class.

Tutorial 10: Applied Transformers

We don’t really have the computational resources to train our own transformer models, but we can play around with pre-trained models.

The HuggingFace Transformers API is an easy way to download a pre-trained model and either use as-is, or fine-tune for your application. I’ll provide a few suggestions, but the official tutorial has a lot more info.

Using pretrained transformer models

If you’re working on Kaggle or Colab, the transformers library may already be installed; otherwise, you’ll need to pip install transformers.

  1. Pick a task and a model from the model list - I recommend sticking to something on the small side for your own sanity. Text generation is fun to play with so I’m going to start there, e.g.:

    from transformers import pipeline
    
    pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
    

    Tip

    If you’re on a GPU-enabled platform like Kaggle, you can move the model to GPU as follows:

    from accelerate import Accelerator
    
    device = Accelerator().device
    
    pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device=device)
    
  2. Test it out! The simplest way to interact with a text generation model is to pass it a string:

    pipe("What is the airspeed velocity of an unladen swallow?")
    
  3. Add context. Some of the models (like the Deepseek example) allow additional chat context, e.g.:

    messages = [
        {"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"},
    ]
    pipe(messages)
    

    How does this affect the value that is returned?

  4. You can inspect and modify the configuration of the pipeline with the pipe.generation_config object. Try changing the temperature parameter and see how it impacts the results, e.g.:

    pipe.generation_config.temperature = 1.5
    

    I’d also recommend reducing the max_new_tokens so that it runs faster and doesn’t ramble quite so much!

  5. To make a real interactive chatbot, create a sentinel loop that:

    1. prompts for input
    2. passes the input to the pipeline
    3. prints out the response
    4. concatenates the response with the previous messages
    5. prompts for input again

There’s lots more to play around with in this library, but that’s probably plenty for the tutorial time period.

Assignment 1: Data discovery and visualization

Due January 30, 2025 at 5 pm. Reasonable requests for extensions will be granted when requested at least 48 hours before the due date.

You may work in groups of 2 or 3. Click here to create your group on GitHub Classroom and clone the starting code.

Overview

Real world data is messy and incomplete in unexpected ways. Often, the information you need is in some kind of text field, or in a totally separate database that needs to be merged in. While I have done some basic filtering of the dataset, the focus of this assignment is on exploring and preparing the data for use with a machine learning model.

After exploring and cleaning your data, you will be training a regressor to predict a numeric value. You may choose any of the regression models from scikit-learn’s supervised learning modules, provided inference (prediction) is fairly fast (i.e. don’t use nearest neighbours). If you want to use something other than scikit-learn, just let me know - it’s probably fine! Again, the main focus of this assignment is the dataset exploration and processing part.

Choosing a model is a whole thing in itself, but don’t stress about it too much. Feel free to compare a couple, or consult this chart for guidance, but it doesn’t cover everything. Notably, decisions trees and random forests are missing, despite being pretty common, easily understood, and high-performing models.

I have set aside a subset of the data to test your model. Model performance will be evaluated and compared using the Mean Absolute Error. Note that since these are not practice or training datasets, the results may not be very good! As long as you’re in a reasonable range (less than about 100% relative error), your grade will not be affected by the prediction performance.

YYC Housing Data

Download the csv from Google Drive (it’s too big for GitHub!)

Original Source

I have removed some redundant columns and excluded non-residential properties such as parking spots. I’ve also combined the 2023 and 2024 assessment values and excluded properties where the roll_number disappeared from one year to the next (e.g. in the case of a subdivision).

Unlike the California housing dataset example, this contains property details and assessed values for each individual house in Calgary. Your goal will be to try to predict the change in assessment value from 2023 to 2024 based on the other columns of the dataset. Be careful though - there are some weird city-specific codes in there. For example, the sub_property_use column contains the following codes:

CodeDescription
RE0100Residential Acreage
RE0110Detached
RE0111Detached with Backyard Suite
RE0120Duplex
RE0121Duplex Building
RE0201Low Rise Apartment Condo
RE0210Low Rise Rental Condo
RE0301High Rise Apartment Condo
RE0310High Rise Rental Condo
RE0401Townhouse
RE0410Townhouse Complex
RE0601Collective Residence
RE0800Manufactured Home

Similarly, the land_use_designation refers to the city’s land use zones, which restrict the type of building that can be constructed on a given property. These zones are about to become much simpler, but for now the column exists in the dataset. It’s up to you to decide how (or if) to use it.

Feel free to get creative! Did odd-numbered houses increase more than even? Does distance from city centre make a difference? Apply your domain knowledge to select and transform your features.

Deliverables

Your assignment should consist of both a .ipynb notebook (committed with cells rendered) for your exploratory analysis and model training, as well as your “production” code providing a predict function.

Exploration and model training notebook

This notebook should follow the general process outlined in class to do the first four steps of the ML Project Checklist, plus training of a simple model. The emphasis is on the data exploration and preparation rather than the model itself. Model shortlisting and fine-tuning is not required (though it is allowed if you’d like).

Specifically, this notebook should include:

  • loading the data
  • setting aside a test set (as appropriate for the problem)
  • exploratory visualizations, with comments about your observations
  • your preprocessing pipeline
  • your model training, either with cross-validation or a set-aside validation dataset
  • saving your pipeline + model for production

Some guidelines are provided in the template notebook - feel free to modify as desired.

If you try something and then ultimately don’t use it, it’s fine to leave it in the notebook. I’d like to see things you thought of and then discarded.

Note

While most of your preprocessing decisions are up to you, please do not drop any samples. If you encounter missing values, you can drop the column or impute values, but the number of predicted values must match the number of input samples.

“Production” code

After deciding on a preprocessing pipeline, training a model, and saving it all to disk, implement the function predict in prod.py. This function should:

  • load the required libraries
  • load your model from disk
  • apply your preprocessing
  • return the predicted values

If you are using Scikit-learn’s Pipeline class to combine preprocessing with your regression model, then this function could be as simple as loading your pipeline and returning pipeline.predict(data).

predict should take as input a Panda dataframe of the data with the re_assessed_value_2024 column removed, and return a Numpy array of predicted property assessment changes from 2023 to 2024. Make sure not to drop any samples! The length of the output must match the length of the input.

Important

Make sure that I can run your prod.py code! I will be running your code from a master script in a Python 3.12 environment with the packages defined in requirements.txt installed. I will also cd to your repo and import prod. Please ensure:

  • You are loading your model using a relative path
  • If you have additional Python dependencies, add them to requirements.txt
  • If you want to use a different language altogether, that is okay - just make sure that your dependencies are clearly documented and easy to install on Windows 11.

Written response

Answer the questions in README.md. Point form and short responses are fine! If you really hate Markdown, you can add a PDF instead.

Tips

  1. I recommend creating a virtual environment and installing the packages in requirements.txt. This will ensure that your code runs on my system:

    python -m venv venv
    pip install -r requirements.txt
    

    on a Mac, use python3 and pip3 instead.

    If you use any other packages and want to add them to the requirements list, you can update it with:

    pip freeze > requirements.txt
    

    (again with pip3 if you are a Mac user).

  2. The end-to-end ML project from the textbook (presented in a condensed form in class) provides examples of some data transformation and visualization techniques, but these do not cover all scenarios. You may need to do some additional research to find the right technique for your dataset - in this case, make sure to cite your sources with a comment in your code.

  3. I have reserved some data for a friendly competition between groups. You might want to test your predict function with your own subset of data to make sure the loading and processing behaves in an isolated environment.

  4. Make sure to remove the target column from the dataframe before processing! I will be calling predict with a dataframe that has re_assessed_value_2024 removed.

Marking Scheme

Each of the following components will be marked on a 4-point scale and weighted.

ComponentWeight
Data exploration (visualizations, observations)30%
Preprocessing decisions30%
Model inference works and training approach is good20%
Written responses15%

ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized

Assignment 2: Playing card classification

Due March 6, 2026 at 5 pm, with the usual weekend flexibility.

You may work in teams of 2 or 3. Click here to create your team on GitHub Classroom and clone the starter code. This can be the same team or different from assignment 1.

Overview

The purpose of this assignment is to apply your theoretical knowledge of neural networks (particularly convolutional neural networks) to a real application. You will build and train a model from “scratch” (using PyTorch or some other framework, not really from scratch), and then see how much you can reduce its size while minimizing performance degradation. In addition to building and tweaking a neural network, this assignment serves as an introduction to:

  • Working with a modern NN framework
  • Preprocessing for image data
  • Evaluating classification models

Dataset

This time, I’m choosing the dataset: Playing Cards. This is quite a clean dataset with reasonable class balance. There are lots of implementations of classifiers using this dataset, and if you do look at someone else’ work for ideas, make sure to cite your sources and understand what you are implementing.

Caution

This dataset is very consistent compared to playing cards found in the wild. I will be evaluating on my own hand-curated dataset - you may want to augment yours with some extra card images as well.

Deliverables

Your assignment should consist of the following:

  1. Your notebook(s) and/or Python scripts where you did your experiments, with the final training run and evaluation rendered
  2. A report describing your experiments and your final model decisions
  3. Your final model classes in a Python module named your_team_name.py, alongside saved weights. Please make sure that your models load and run properly in Colab. If additional packages are required, list them in your report document.

I would recommend working in parallel with your teammate(s) and commit your changes after each experiment. It’s fine if you have multiple working notebooks, just indicate to me which is the final version.

Your training code

Your training code (notebook or Python script) should:

  • Load the training data and do some basic data exploration, like looking at samples, number of classes, class distribution, etc. The code in starter.ipynb provides some ideas for connecting Colab to Google Drive, defining the training dataset, and inspecting a few samples.
  • Do any preprocessing you might want to do (at the very least, you’ll probably want to rescale the images from unsigned ints in the range 0-255 to floats in the range 0 to 1)

    Hint: Pytorch has some preprocessing layers that you can stick on the start of your model much like Scikit-learn’s pipelines. Check out Torchvision Transforms for more ideas.

  • Define and train a model, keeping in mind the following:
    • The input layer must match the number of channels of your input. You do not need to define the batch size.
    • The output layer must have as many neurons as classes you are trying to predict.
    • Everything in between is a design choice that you can tweak!
  • Iterate! I would suggest starting with a simple CNN feeding in to a fully connected output layer. I included my fairly random and not at all optimized model in the starter code.
  • Train two models: one “best performance” version, where you try to get the highest accuracy, and one “size optimized” where you try to maintain reasonably good accuracy with the fewest parameters.
  • Once you’ve trained your models, save the weights using torch.save(model.state_dict()) as described here. You may need to share these with me via Google Drive if they end up too big, but if not, you can just commit the binary weights file to your repo.

Your production code

To load the weights for your model and run inference, PyTorch needs to know the class definition. There is a way of saving the whole thing at once, but it’s fragile (much like pickling a Scikit-learn pipeline with custom functions).

After you’re happy with both small and large models, copy the class definitions in a Python module named for your team name (e.g. super_awesome_team.py). Also copy over your transformation pipeline - I will be passing this to ImageFolder when I load the top secret evaluation set. You can either use the same one for small and large models, or two separate ones, or parameterize with image size, etc.

from super_awesome_team import SmallModel, BigModel, img_transform
model = SmallModel()
model.load_state_dict(torch.load(path_to_small_weights, weights_only=True))
model.eval() # disables dropout and batch norm for inference

Tip

Make sure to include any necessary preprocessing like rescaling in your transformation pipeline, but do not include image augmentation steps like RandomHorizontalFlip.

Your report

In a separate document, summarize your experiments, models, observations, reflections, etc. I’ve provided a template with more details in the starter code (report.md), though you aren’t limited to the markdown format.

Marking Scheme

Each of the following components will be marked on a 4-point scale and weighted.

ComponentWeight
Report: model development and experimentation20%
Report: reflections20%
Report: abstract and appendices20%
Model: load and run on Colab20%
Model: performance (highest accuracy model)10%
Model: performance / parameters ratio (size optimized)10%
ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized

Assignment 3: Classification of Text Data

Due March 20 27, 2026

You may work in teams up to 3. Click here to create your team on GitHub Classroom.

Overview

The purpose of this assignment is to further your hands-on experience in neural networks with a new type of data: text!

Dataset and task

Download the pre-split “Reddit jokes” csv here (200 MB). You will still need to do a validation split, but I’ve reserved a final test set as with the previous assignments.

The original dataset is 1 million reddit jokes, but I’ve modified it a bit. I added a threshold to the score value to classify each joke as is_funny = True or is_funny=False, and I’ve also removed some redundant columns. I left the score value in there so you can see just how “funny” people thought a given joke was, but the score cannot be used as a predictor.

Your task is to predict whether or not a joke is funny. You may use any of the information in the dataset except for the score column to predict the is_funny value.

[!WARNING] I have not vetted this dataset for appropriateness. While the “not safe for work” label is False for all features, there is still likely to be the usual degeneration one encounters on Reddit.

Many of the posts have [removed], [deleted], and NaN for the body text field. This is still information! Aside from replacing NaN with an empty string, you probably want to keep these features.

Deliverables

Your repo should have the following:

  1. Your notebook(s) and/or Python scripts where you did your experiments, with the final training run and evaluation rendered
  2. A report describing your experiments and your final model decisions
  3. An inference script with a classify function that loads your model, performs any preprocessing necessary on the list of titles given as inputs, then returns a list of booleans containing the predicted funniness. If your model file(s) is too big for GitHub, you may share a link to it on Google Drive.

Resources

GPU resources are a challenge, and potentially more important with text data. Here are a few options to consider:

  • Kaggle provides 30 hrs/week of free GPU usage
  • Colab provides a “pay as you go” tier, which is 14/month. I hate asking students to pay for things, but think of it like the cost of a textbook. I’m testing out the pay as you go option, and it seems to be using up credits at roughly 1.5/hr on a T4 TPU.

Your training notebook

I’ve included some sample code in the starter.ipynb to get you started. If you’re not using Colab and/or PyTorch, you might need to massage things a bit more (yes, this ends up being an inordinate amount of time in any data project).

Your inference script

This time, I’m asking that you implement an inference script similar to assignment 1 in prod.py. This should handle your data processing, model class definitions, loading the model weights, etc. You can assume that I will be loading the model weights from the working directory; please use relative paths to your weights file.

Your report

In a separate document, summarize your experiments, models, observations, reflections, etc. I’ve provided a template with more details in the starter code (report.md), though as usual you aren’t limited to the markdown format.

Marking Scheme

I’ll be marking each component on the usual 4-point scale and weighting as follows:

ComponentWeight
Model development and experimentation process30%
Model performance and compatibility20%
Report: reflections30%
Report: other stuff20%

As usual, the raw performance doesn’t matter too much provided it behaves “okay”

ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized

Journal Club

Your 10% journal club mark consists of two components:

  • Reading each paper and participating in weekly discussions (5%)
  • Reading a single paper more thoroughly and presenting that paper (5%)

The signup sheet is on D2L. You will need to be logged in to your MRU Google account to access it.

Presentation

The week you have signed up to present, read the paper and prepare a (maximum) 10 minute presentation. You may use whatever file format you wish, and feel free to copy and paste relevant diagrams or quotes from the paper (a single citation at the start suffices for this purpose). If you include material from other resources, these must be cited as usual.

When creating your presentation, consider the following guiding questions. You don’t need to use these headings specifically, and not every question applies to every paper, but it’s a place to start.

Paper Summary

  • What are the main contributions of this work?
  • What is the context (timeframe, prior knowledge, etc) of this paper?
  • How did the authors justify their conclusions?

Paper’s influence

With the power of hindsight, we can look back and see what impact each paper had on the field of machine learning. Most of them are widely cited. Consider the following:

  • If applicable, try to find implementations of the algorithms described by the paper. These may be shared by the authors (for more recent papers), or implemented by others (for pre-2000s papers)
  • Are the methods described in the paper still relevant today?
  • What is your impression of the impact that this paper had on the field of machine learning?

Your opinions

  • What did you like about the paper?
  • What was the most confusing or challenging to understand?
  • Any additional thoughts?

Rubric

Each of the three sections will be assessed on a 4-point scale, with an additional score for sharing your thoughts and responding to classmates’ questions during the discussion period.

ComponentScore out of 4
Paper summary
Paper’s influence
Your opinions
Discussion
Total out of 16



ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized

Final Project: Predict something with some data

DeliverableDue dateWeight (of project)
1ProposalFeb 27
Pitch Presentation in lab on March 2
15%
2Checkpoint 1March 2720%
2Checkpoint 2April 1320%
4Final ReportApril 24
Presentations on April 22*
45%

*I asked the registrar to schedule a block of time during final exams for you to share the outcomes of your projects, so this shows up on MyMRU as a final exam. You can put finishing touches on your report in the days afterwards.

You may work in teams of 2 or 3.

Overview

The goal of the final project is to showcase the skills you’ve learned in this course by applying them to a new dataset and task. You are free to choose anything you like, but do not pick a “hello-world” style dataset. I’d rather you take a risk and have it not be successful than recreate yet another Titanic survival prediction model.

Overall, throughout this project you will:

  1. Choose a task, such as classification, regression, time-series forecasting, NLP, etc.
  2. Find an appropriate dataset for that task. This is harder than it sounds, and may require some iterating when you start working with the data.
  3. Explore the data and preprocess as necessary to handle missing values, feature scaling, etc.
  4. Implement a model to accomplish your task.
  5. Write a formal report describing your project.

Tip

Often when you are writing a report and describing your methodology, you realize there is an error in that methodology and end up having to go back and fix something. Don’t leave the writeup to the very end - it’s a good idea to write as you go along.

Finding a Dataset

First, think of your interests and a possible prediction task, then look to see if there’s a relevant publicly accessible dataset. As you poke around you might do it the other way and stumble across an interesting dataset that inspires a task - that’s okay too!

Some good places to look for datasets:

  • Hugging Face - A popular data- and model-sharing site, and the successor to the now-defunct “Papers with Code” (RIP)
  • Google Dataset Search - of course Google does datasets as well.
  • Wikipedia - a giant list of datasets for machine learning research.
  • Google BigQuery Datasets - we’ll be using this in assignment 3, so I’ll provide guidance on accessing it
  • Kaggle - a website for machine learning competitions with a huge number of datasets. You can filter by file size, topic and more

Submissions

This time, there is no starter code to share, as each of your projects will be different. Please submit your proposal, checkpoint and report as PDFs on D2L, and include a link to your code (e.g. on GitHub or Google Drive) in your final report abstract. This code may be public so you can use it as a portfolio piece, or if you prefer to keep it private, you can invite me to view it on GitHub.

4-Point Scale

Each of the components will be marked on various criteria using the usual 4 point scale. More details provided in each component description.

ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized

Project Proposal

Your project proposal should be fairly concise, 1-2 pages consisting of the following sections.

Abstract

1-paragraph summary what you intend to do and with what data.

Project Plan

Using words, tables, images, etc, describe:

  • The goal of your project
  • The dataset you intend to use
  • The approach you plan to take, referring to at least one paper

Risk Assessment and Backup Plan

Scoping a project is very challenging, and generally we want to do more than can be accomplished in the provided time. In this section, identify:

  • Challenges you anticipate, and potential solutions/mitigations (this may be best presented as a table)
  • A reduced scope goal that will still constitute project success if your primary objective can’t be met

References

Include a link to your proposed dataset and cite at least one technical paper describing the approach you plan to use.

Pitch Presentation

Prepare a very short presentation (1-2 slides, 3-4 minutes) to “pitch” your project during the lab on Monday, March 2. You can highlight whatever part is the most exciting to you, the main goal is to share with the class and learn who is working on similar projects.

Marking Scheme

Using usual 4 point scale, I will be assessing the following equally-weighted categories:

  • Description of task and dataset
  • Thoughtful risk assessment and contingency plan
  • Appropriate references and tentative approach
  • Pitch presentation

For a total of 16 points.

2. Checkpoint

At this point, you should have begun some preliminary data exploration and preprocessing. You may have also begun building your model pipeline. Your checkpoint report should again be 1-2 pages with the following sections:

Wins

What has been working so far? What are you excited about? Share your good news! This would be a good place to show some preliminary results or interesting data discoveries.

Challenges

What challenges (foreseen or otherwise) have you encountered? Is there anything in particular that you need help to get past?

Remaining work

What do you have left to do on your project? Is it feasible within the time constraints? If not, how do you plan to revise your project plan?

Project revision (if needed)

As a result of your progress to date, do you anticipate any changes to your project plan? It’s okay if it’s a significant change - perhaps you discovered a better dataset, or came up with a different approach.

If you do have major changes, please cite the relevant sources for your new revised project.

References

This section may not be necessary if you don’t have significant changes.

Marking Scheme

Using the usual 4 point scale, I will be assessing the following categories:

  • Reasonable amount of progress made on project
  • Description of challenges and assessment of progress
  • Remaining work and revision (if needed)
  • Overall coherence and clarity

For a total of 16 points.