Consider the 1-d case:
we want the values of

Math time!
After some algebraic gymnastics, we get:
where
Instead of the scalar
where each row is an instance (sample) and each column is a feature.
The first column is all ones, representing the bias term
We can rewrite the estimate in matrix notation:
The MSE can be written as:
where we've used the trick of substituting
Find the gradient of the MSE w.r.t
, set it to zero, and solve for
The following properties are useful for solving linear algebra problems:
Additionally, any matrix or vector multiplied by
We made it! The Normal Equation is again:
The goal of gradient descent is still to minimize the cost function, but it follows an iterative process:
Start with a random
Calculate the gradient
Update
Repeat 2-3 until some stopping criterion is met
where
The loss function is the function being minimized by gradient descent
MSE is convex and guaranteed to have a single global minimum, but many other loss functions have multiple local minima
The relative scale of the features can affect the convergence:

Note: the term cost function is often used instead of loss function
Logistic regression is a binary classifier that uses the logistic function (aka sigmoid function) to map the output to a range of 0 to 1:
We can then minimize the log loss or cross-entropy loss function:
where
The gradient of the log loss ends up being:
y = mx + b
What is the meaning of yhat being a vector?
Draw a gradient on the board and illustrate steps