Lecture 5: Classification

HTML Slides | PDF Slides

Classification loss functions and metrics

COMP 4630 | Winter 2025 Charlotte Curtis

Overview

All the derivation thus far has been for mean squared error
Cross-entropy loss is more appropriate for classification problems
References and suggested reading:
- Scikit-learn book: Chapter 4, training models
- Scikit-learn docs: Log loss
- Deep Learning Book: Sections 3.1, 3.8, and 6.2

Revisiting the expected value

The expected value of some function $f (x)$ when $x$ is distributed as $P (x)$ is given in discrete form as:

$E [f (x)] = x \sum P (x) f (x)$

where the sum is over all possible values of $x$ .

In continuous form, this is an integral:

$E [f (x)] = \int p (x) f (x) d x$

Binary case: Bernoulli distribution

If a random variable $x$ has a $p$ probability of being 1 and a $1 - p$ probability of being 0, then $x$ is distributed as a Bernoulli distribution: $P (x) = p^{x} (1 - p)^{1 - x} = {p 1 - p for x = 1 for x = 0$
The expected value of $x$ is then: $E [x] = x \sum P (x) x = 0 \cdot (1 - p) + 1 \cdot p = p$

Information theory

Originally developed for message communication, with the intuition that less likely events carry more information, defined for a single event as:

$I (x) = - lo g P (x)$

h:350 center

Entropy

bg right:35% fit

We can measure the expected information of a distribution $P (x)$ as: $H (X) = E [I (x)] = - E_{x \sim P} [lo g P (x)]$
This is called the Shannon entropy
Measured in bits (base 2) or nats (base $e$ )
:abacus: Find the entropy of a bernoulli distribution

Cross-entropy

The KL divergence is a measure of the extra information needed to encode a message from a true distribution $P (x)$ using an approximate distribution $Q (x)$ : $D_{K L} (P ∣∣ Q) = E_{x \sim P} [lo g \frac{P ( x )}{Q ( x )}] = E_{x \sim P} [lo g P (x) - lo g Q (x)]$
The cross-entropy is a simplification that drops the term $lo g P (x)$ : $H (P, Q) = - E_{x \sim P} [lo g Q (x)]$
Minimizing the cross-entropy is equivalent to minimizing the KL divergence
If $P (x) = Q (x)$ , then $D_{K L} (P ∣∣ Q) = 0$ and $H (P, Q) = H (P)$

Cross-entropy loss

bg right:40% fit

For a true label $y \in {0, 1}$ and predicted $\overset{p}{^} \in [0, 1]$ , the cross-entropy loss is:

$L (y, \overset{p}{^}) = = - E_{y} [lo g P (x)] - y lo g \overset{p}{^} - (1 - y) lo g (1 - \overset{p}{^})$

where $\overset{p}{^} = σ (w^{T} h + b)$ is the output of the final layer of a neural network (thresholded to obtain the prediction $\overset{y}{^}$ )

This is also called log loss or binary cross-entropy

Terminology for evaluation

True positive: predicted positive, label was positive ( $TP$ ) ✔️
True negative: predicted negative, label was negative ( $TN$ ) ✔️
False positive: predicted positive, label was negative ( $FP$ ) ❌ (type I)
False negative: predicted negative, label was positive ( $FN$ ) ❌ (type II)
Accuracy is the fraction of correct predictions, given as:

$accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

Precision and recall

Precision: Out of all the positive predictions, how many were correct? $precision = \frac{TP}{TP + FP}$
Recall: Out of all the positive labels, how many were correct? $recall = \frac{TP}{TP + FN}$
Specificity: Out of all the negative labels, how many were correct? $specificity = \frac{TN}{TN + FP}$

Confusion matrix

	Predicted Positive	Predicted Negative
True Positive	TP	FN
True Negative	FP	TN

The axes might be reversed, but a good predictor will have strong diagonals
There’s also the F1 score, or harmonic mean of precision and recall: $F 1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$

ROC Curves

The receiver operating characteristic curve is a plot of the true positive rate (recall or sensitivity) vs. false positive rate (1 - specificity) as the detection threshold changes
The diagonal is the same as random guessing
A perfect classifier would hug the top left corner

Fun fact: the name comes from WWII radar operators, where true positives were airplanes and false positives were noise

Which classifier is better?

center

Multiclass case

For $K$ classes, the output is a vector $\hat{p}$ with $\overset{p}{^}_{i} = P (y = i ∣ x)$
The cross-entropy loss is then: $L (y, \hat{p}) = - i = 1 \sum K y_{i} lo g \overset{p}{^}_{i}$
For a one-hot encoded vector $y$ , this simplifies to: $L (y, \hat{p}) = - lo g \overset{p}{^}_{k}$ where $k$ is the index of the true class

The softmax function

For binary classification, the sigmoid function $σ (z) = \frac{1}{1 + e ^{- z}}$ is used to predict the probability of the positive class
For multiclass classification, the softmax function is used:

$\overset{p}{^}_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{K} e ^{z_{j}}}$

where $z_{i} = w_{i}^{T} h + b_{i}$ is the output of neuron $i$ in the final layer before the activation function is applied
This means that $K$ neurons are needed in the final layer, one for each class

Next up: Convolution and NN frameworks

Keyboard shortcuts

COMP 4630 | Winter 2026