Lecture 2: Math Review

HTML Slides | PDF Slides

Math review

COMP 4630 | Winter 2026 Charlotte Curtis

Math review

MATH 1203: Linear algebra
MATH 1200: Differential calculus
MATH 2234: Statistics

Linear algebra

Vectors are multidimensional quantities (unlike scalars):

$v = v = v_{1} v_{2} ⋮ v_{n}$

A common vector space is $R^{2}$ , or the 2D Euclidean plane. Example:

$v_{1} = [34]$

Vector operations

Addition: $v_{1} + v_{2} = [v_{11} + v_{21} v_{12} + v_{22}]$
Scalar multiplication: $c v = [c v_{1} c v_{2}]$
Dot product: $v_{1} \cdot v_{2} = v_{11} v_{21} + v_{12} v_{22}$ (yields a scalar)
- Can be thought of as the projection of one vector onto another, or how much two vectors are aligned in the same direction

Vector norms

The norm of a vector is a measure of its length
Most common is the Euclidean norm (or $L^{2}$ norm): $∥ v ∥_{2} = ∥ v ∥ = (i = 1 \sum n v_{i}^{2})$
You might also see the $L^{1}$ norm, particularly as a regularization term: $∥ v ∥_{1} = i = 1 \sum n ∣ v_{i} ∣$

Useful vectors

Unit vector: A vector with a norm of 1, e.g. $x = [10]$ , $y = [01]$
Normalized vector: A vector divided by its norm, e.g. $v = \hat{v} = \frac{v}{∥ v ∥}$
Dot product can also be written as $v_{1} \cdot v_{2} = ∥ v_{1} ∥∥ v_{2} ∥ cos (θ)$

Yes, a normalized vector is also a unit vector, main difference is in context and notation

Matrices

A matrix is a 2D array of numbers:

$A = [a_{11} a_{21} a_{12} a_{22} a_{13} a_{23}]$

Notation: Element $a_{ij}$ is in row $i$ , column $j$ , also written as $A_{ij}$ .

Rows then columns! $M \times N$ matrix has $M$ rows and $N$ columns

Matrix operations

Addition: element-wise if dimensions match. $A + B = B + A$
Scalar multiplication: just like vectors
Matrix multiplication: $C = A B$ where the elements of $C$ are: $c_{ij} = k = 1 \sum n a_{ik} b_{kj}$
- Multiply and sum rows of $A$ with columns of $B$
- Usually, $A B \neq = B A$

Matrix multiplication examples

Matrix times a matrix: $A = 21 - 4 035, B = [- 1 1 0317]$

Matrix times a vector: $A = [01 - 1 0], v = [10]$

Where we left off on January 14

Matrix transpose

Transpose: $A^{T}$ swaps rows and columns $A = [1324], A^{T} = [1234]$
Inverse: just as $\frac{1}{x} \cdot x = 1$ , $A^{- 1} A = I$ , where $I$ is the identity matrix $A = [1324], A^{- 1} = [- 2 1.5 1 - 0.5], A^{- 1} A = [1001]$

Not every matrix is invertible!

Calculus: Notation

The derivative of a function $y = f (x)$ is represented as:

$f^{'} (x) = \frac{d y}{d x} = h \to 0 lim \frac{f ( x + h ) - f ( x )}{h}$

The second derivative is denoted:

$f^{''} (x) = \frac{d ^{2} y}{d x ^{2}} = \frac{d}{d x} (\frac{d y}{d x})$

and so on.

Differentiability

bg left fit

For a function to be differentiable at a point $x_{A}$ , it must be:

Defined at $x_{A}$
Continuous at $x_{A}$
Smooth at $x_{A}$
Non-vertical at $x_{A}$

Select rules of differentiation

	Function $f$	Lagrange	Leibniz
Constant	$f (x) = c$	$f^{'} (x) = 0$	$\frac{df}{d x} = 0$
Power	$f (x) = x^{r}$ with $r \neq = 0$	$f^{'} (x) = r x^{r - 1}$	$\frac{df}{d x} = r x^{r - 1}$
Sum	$f (x) = g (x) + h (x)$	$f^{'} (x) = g^{'} (x) + h^{'} (x)$	$\frac{df}{d x} = \frac{d g}{d x} + \frac{d h}{d x}$
Exponential	$f (x) = e^{x}$	$f^{'} (x) = e^{x}$	$\frac{df}{d x} = e^{x}$
Chain Rule	$f (x) = g (h (x))$	$f^{'} (x) = g^{'} (h (x)) h^{'} (x)$	$\frac{df}{d x} = \frac{d g}{d h} \frac{d h}{d x}$

Chain rule example

Find $\frac{df}{d x}$ for $f (x) = σ (x) = \frac{1}{1 + e ^{- x}}$
Now, let, $y = σ (x_{1})$ , where $x_{1} = w x$ . What is $\frac{d y}{d x}$ ?

Partial derivatives

For a scalar valued function $y = f (x_{1}, x_{2})$ , there are two partial derivatives:

$\frac{\partial y}{\partial x _{1}}, \frac{\partial y}{\partial x _{2}}$

These are computed by holding the “other” variable(s) constant. For example, if $y = 2 x_{1} + x_{2} + x_{1} x_{2}$ , then:

$\frac{\partial y}{\partial x _{1}} = 2 + x_{2}, \frac{\partial y}{\partial x _{2}} = 1 + x_{1}$

A brief introduction to vector calculus

Putting together partial derivatives with vectors and matrices we get:

Scalar-valued $f (x)$ :

$\nabla f = \frac{\partial f}{\partial x _{1}} \frac{\partial f}{\partial x _{2}} ⋮ \frac{\partial f}{\partial x _{n}}$

Vector-valued $f (x)$ :

$J_{f} = \nabla^{T} f_{1} \nabla^{T} f_{2} ⋮ \nabla^{T} f_{m} = \frac{\partial f _{1}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}}$

Most of the time we’ll just be working with the gradient

Statistics: Notation

A random variable $x \sim P$ is a variable that can take on random variables according to some probability distribution $P$
$x$ may take on discrete (e.g. dice rolls) or continuous (e.g. age) values
$X$ or $x$ for the random variable and $x$ or $x_{i}$ for a specific value
$P (x)$ for a a discrete distribution and $p (x)$ for continuous
$x_{P} \equiv x \sim P$ and $x_{p} \equiv x \sim p$

Some textbooks/papers/websites use different notation!

Discrete random variables

A discrete probability mass function describes the probability of $x$ taking on a specific value
Example: for a balanced 6-sided die, $P (x = 1) = \frac{1}{6}$
You can add together probabilities, e.g. $P (x \leq 3) = i = 1 \sum 3 P (x = i)$
$x \sum P (x) = 1$ and $P (x_{i}) \geq 0$ for any valid distribution

Continuous random variables

A continuous probability density function gives the probability of being in some tiny interval $δ x$ given by $p (x) δ x$
Example: the uniform distribution, $p (x) = \frac{1}{b - a}$ for $a \leq x \leq b$
$p (x = x_{i}) = 0$ for any specific value $x_{i}$
Need to integrate to get a concrete value, e.g. $p (x \leq a) = \int_{- \infty}^{a} p (x) d x$
$\int_{- \infty}^{\infty} p (x) d x = 1$ and $\int_{a}^{b} p (x) d x \geq 0$ for any valid distribution

Expectation and variance

The expectation or expected value is its average value $E [x]$
$E [x_{P}] = x \sum x P (x)$ and $E [x_{p}] = \int_{- \infty}^{\infty} x p (x) d x$
More generally, for any function $f (x)$ : $E [f (x)] = x \sum f (x) P (x) and \int_{- \infty}^{\infty} f (x) p (x) d x$
The variance describes how much the values vary from their mean: $Var [x] = E [(x - E [x])^{2}]$

Multiple random variables

Joint probability $P (x, y)$ is the probability of $x$ and $y$ occurring together
Conditional probability $P (x = x ∣ y = y)$ is the probability that $x$ takes on value $x$ given that $y = y$ has already happened
In general, $P (x = x ∣ y = y) = \frac{P ( x = x , y = y )}{P ( y = y )}$
For independent variables, $P (x = x ∣ y = y) = P (x = x)$

Covariance

The covariance between $f (x)$ and $g (y)$ gives a sense of how linearly related they are and how much they vary together: $Cov (f (x), g (y)) = E [(f (x) - E [f (x)]) (g (y) - E [g (y)])]$
Related to correlation as $Corr (f (x), g (y)) = \frac{Cov ( f ( x ) , g ( y ))}{Var ( f ( x )) Var ( g ( y ))}$
The covariance matrix of a random vector $x$ is a square matrix where the $(i, j)$ element is the covariance between $x_{i}$ and $x_{j}$
The diagonal of the covariance matrix gives $Var (x_{i})$

The Normal distribution

$N (x; u, σ^{2}) = \frac{1}{2 π σ ^{2}} exp^{(- \frac{1}{2 σ ^{2}} (x - μ)^{2})}$

Good “default choice” for two reasons:

The central limit theorem shows that the sum of many ( $> 30$ ish) independent random variables is normally distributed
Has the most uncertainty of any distribution with the same variance

We can’t easily integrate $N (x)$ , so numerical approximations are used

bg fit

Coming up next

Training (regression) models
- Linear regression
- Gradient descent
References and suggested reading:
- Scikit-learn book:
  - Chapter 4: Training Models
- Deep Learning Book
  - Section 5.1.4: Linear Regression

Keyboard shortcuts

COMP 4630 | Winter 2026