If a perceptron can't even solve XOR, how can it do higher order logic?
Consider that XOR can be rewritten as:
A xor B = (A and !B) or (!A and B)
A perceptron can solve and and or and not... so what if the input to the or perceptron is the output of two and perceptrons?

Initialize the weights, through some random-ish strategy
Perform a forward pass to compute the output of each neuron
Compute the loss of the output layer (e.g. MSE)
Calculate the gradient of the loss with respect to each weight
Update the weights using gradient descent (minibatch, stochastic, etc)
Repeat steps 2-5 until stopping criteria met
Step 4 is the "backpropagation" part
and
Weights in the second layer (connecting hidden and output):
For the first layer (connecting inputs to hidden):
where
For layer 1:
where
| Parameter | Gradient |
|---|---|
| Weights of layer 2 | |
| Bias of layer 2 | |
| Weights of layer 1 | |
| Bias of layer 1 |
Many of the terms computed in the forward pass are reused in the backward pass (such as the inputs to each layer)
Similarly, gradients computed in layer
Typically each intermediate value is stored, but modern networks are big
| Model | Parameters |
|---|---|
| Our example | 6 |
| AlexNet (2012) | 60 million |
| GPT-3 (2020) | 175 billion |
The simple example used a linear activation function (identity)
To include other activation functions, the forward pass becomes:
The gradient in the output layer becomes:
where
Problem! That step function in the original perceptron is not differentiable
The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles.
-- Deep Learning Book, Section 6.3
Draw the functions and derivatives on the board