Backpropagation Deep Dive
The Central Question
You have a neural network with millions of parameters. The network makes a prediction. The prediction is wrong. How do you know which of those millions of parameters to adjust, and by how much? This is the problem backpropagation solves. It’s the algorithm that makes deep learning possible.Historical Context: Backpropagation was independently discovered multiple times, but Rumelhart, Hinton, and Williams’ 1986 paper popularized it. It remained largely unused until 2012 because:
- Computers weren’t fast enough
- Data wasn’t available
- Techniques to train deep networks weren’t developed
The Core Insight: Credit Assignment
Imagine a factory assembly line:Computational Graphs
To understand backpropagation, we first need to see neural networks as computational graphs.A Simple Example
Let’s compute :Backward Pass with Chain Rule
To compute , we apply the chain rule: Let’s trace through:- , so
- , so
- Therefore:
The Chain Rule: The Key to Everything
The chain rule states: Or for a chain of functions : This is the ONLY math you need for backpropagation!Intuition
If doubles when increases by 1, and triples when doubles:- Then should increase by when increases by 1
Backpropagation in a Neural Network
Now let’s apply this to an actual neural network layer.Single Neuron
A single neuron computes: where is the activation function (e.g., sigmoid). Forward pass:Full Backpropagation Algorithm
For a multi-layer network, backpropagation works as follows:Algorithm
Implementation from Scratch
Gradient Checking
How do you know your gradients are correct? Numerical gradient checking:Visualizing Gradient Flow
The Vanishing Gradient Problem
With sigmoid activation:- Derivative:
- After 10 layers:
- ReLU activation: Gradient is 1 for positive inputs
- Batch normalization: Keeps activations in a good range
- Residual connections: Skip connections add gradients
- Better initialization: He/Xavier initialization
PyTorch Autograd
PyTorch handles all of this automatically:How Autograd Works
PyTorch builds a computational graph during the forward pass:Visualizing the Computation Graph
Common Backprop Patterns
Pattern 1: Add Gate
Gradient distributor: Passes gradient unchanged to both inputs.Pattern 2: Multiply Gate
Gradient switcher: Gradient to is scaled by and vice versa.Pattern 3: Max Gate
Gradient router: Gradient flows only to the larger input.Pattern 4: ReLU
Gradient gate: Passes gradient if input was positive, blocks if negative.Exercises
Exercise 1: Manual Backprop
Exercise 1: Manual Backprop
Compute the gradients by hand for this simple network on the input :With , , , .Verify your answer with PyTorch.
Exercise 2: Custom Autograd Function
Exercise 2: Custom Autograd Function
Implement a custom activation function with PyTorch autograd:Test that it gives the same results as
nn.ReLU().Exercise 3: Gradient Flow Analysis
Exercise 3: Gradient Flow Analysis
Create a 50-layer network with:
- Sigmoid activations
- ReLU activations
- Tanh activations
Exercise 4: Implement Batch Norm Backward
Exercise 4: Implement Batch Norm Backward
Implement the backward pass for batch normalization:Forward: The backward pass involves computing , , .
Key Takeaways
| Concept | Key Insight |
|---|---|
| Chain Rule | Multiply gradients along paths |
| Computational Graph | Break complex functions into simple operations |
| Forward Pass | Compute outputs, cache intermediates |
| Backward Pass | Compute gradients from output to input |
| Gradient Checking | Verify with numerical gradients |
| Vanishing Gradients | Sigmoid/tanh → use ReLU |