At layer
And so on, until we reach the first layer.
We are recursively applying the chain rule and re-using the gradients computed at the previous layer
This is great for computational efficiency, but it can also lead to vanishing or exploding gradients
Note: this may not be a problem, and ReLU is cheap. Don't optimize prematurely unless you're seeing lots of "dead" neurons.
Ultimately, yet another a hyperparameter to be tuned
From the Deep Learning Book, section 11.2: