Skip to main content
Backpropagation Concept

Backpropagation Deep Dive

The Central Question

You have a neural network with millions of parameters. The network makes a prediction. The prediction is wrong. How do you know which of those millions of parameters to adjust, and by how much? This is the problem backpropagation solves. It’s the algorithm that makes deep learning possible.
Historical Context: Backpropagation was independently discovered multiple times, but Rumelhart, Hinton, and Williams’ 1986 paper popularized it. It remained largely unused until 2012 because:
  1. Computers weren’t fast enough
  2. Data wasn’t available
  3. Techniques to train deep networks weren’t developed
Today, it runs trillions of times per second in data centers worldwide.

The Core Insight: Credit Assignment

Imagine a factory assembly line:
Raw Materials → [Worker 1] → [Worker 2] → [Worker 3] → Defective Product
The final product is defective. Who’s at fault? You need to trace backward through the assembly line to find where things went wrong. That’s exactly what backpropagation does.
Credit Assignment Problem

Computational Graphs

To understand backpropagation, we first need to see neural networks as computational graphs.

A Simple Example

Let’s compute f(x,y,z)=(x+y)zf(x, y, z) = (x + y) \cdot z:
       [+]
      /   \
     x     y
       \   /
        [*]──── z
          \
           f
import numpy as np

# Forward pass
def forward(x, y, z):
    a = x + y      # Intermediate result
    f = a * z      # Final result
    return f, a    # Return intermediate for backprop

x, y, z = 2, 3, 4
f, a = forward(x, y, z)
print(f"f = ({x} + {y}) × {z} = {a} × {z} = {f}")

Backward Pass with Chain Rule

To compute fx\frac{\partial f}{\partial x}, we apply the chain rule: fx=faax\frac{\partial f}{\partial x} = \frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x} Let’s trace through:
  1. f=azf = a \cdot z, so fa=z=4\frac{\partial f}{\partial a} = z = 4
  2. a=x+ya = x + y, so ax=1\frac{\partial a}{\partial x} = 1
  3. Therefore: fx=41=4\frac{\partial f}{\partial x} = 4 \cdot 1 = 4
# Backward pass
def backward(z, a, grad_f=1.0):
    """
    grad_f: gradient of loss w.r.t. f (usually 1.0 if f is the loss)
    """
    # df/da = z (from f = a * z)
    grad_a = grad_f * z
    
    # df/dz = a (from f = a * z)
    grad_z = grad_f * a
    
    # da/dx = 1 (from a = x + y)
    grad_x = grad_a * 1
    
    # da/dy = 1 (from a = x + y)
    grad_y = grad_a * 1
    
    return grad_x, grad_y, grad_z

grad_x, grad_y, grad_z = backward(z, a)
print(f"∂f/∂x = {grad_x}, ∂f/∂y = {grad_y}, ∂f/∂z = {grad_z}")

The Chain Rule: The Key to Everything

The chain rule states: fx=fggx\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x} Or for a chain of functions f(g(h(x)))f(g(h(x))): fx=fgghhx\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x} This is the ONLY math you need for backpropagation!

Intuition

If gg doubles when xx increases by 1, and ff triples when gg doubles:
  • Then ff should increase by 3×2=63 \times 2 = 6 when xx increases by 1
The chain rule is about multiplying sensitivities along a path.
Chain Rule Visualization

Backpropagation in a Neural Network

Now let’s apply this to an actual neural network layer.

Single Neuron

A single neuron computes: y=σ(wx+b)y = \sigma(wx + b) where σ\sigma is the activation function (e.g., sigmoid). Forward pass:
def neuron_forward(x, w, b):
    z = w * x + b          # Linear combination
    y = sigmoid(z)          # Activation
    return y, z             # Cache z for backward pass

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
Backward pass: We need Lw\frac{\partial L}{\partial w} and Lb\frac{\partial L}{\partial b}: Lw=Lyyzzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}
def neuron_backward(x, z, grad_y):
    """
    grad_y: gradient of loss w.r.t. output y (∂L/∂y)
    """
    # ∂y/∂z: sigmoid derivative
    sig = sigmoid(z)
    grad_z = grad_y * sig * (1 - sig)
    
    # ∂z/∂w = x, ∂z/∂b = 1, ∂z/∂x = w
    grad_w = grad_z * x
    grad_b = grad_z * 1
    grad_x = grad_z * w  # For passing to previous layer
    
    return grad_w, grad_b, grad_x

Full Backpropagation Algorithm

For a multi-layer network, backpropagation works as follows:

Algorithm

1. FORWARD PASS:
   For each layer l = 1, 2, ..., L:
       z[l] = W[l] @ a[l-1] + b[l]    # Linear
       a[l] = activation(z[l])         # Non-linear

2. COMPUTE LOSS:
   L = loss_function(a[L], y_true)

3. BACKWARD PASS:
   Compute ∂L/∂a[L] (depends on loss function)
   
   For each layer l = L, L-1, ..., 1:
       ∂L/∂z[l] = ∂L/∂a[l] * activation'(z[l])  # Element-wise
       ∂L/∂W[l] = ∂L/∂z[l] @ a[l-1].T
       ∂L/∂b[l] = sum(∂L/∂z[l])
       ∂L/∂a[l-1] = W[l].T @ ∂L/∂z[l]          # Pass to prev layer

4. UPDATE PARAMETERS:
   W[l] = W[l] - α * ∂L/∂W[l]
   b[l] = b[l] - α * ∂L/∂b[l]

Implementation from Scratch

class BackpropNetwork:
    """Neural network with manual backpropagation."""
    
    def __init__(self, layer_sizes):
        self.n_layers = len(layer_sizes) - 1
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.1
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(W)
            self.biases.append(b)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """Forward pass, caching values for backprop."""
        self.a = [X]  # Activations
        self.z = []   # Pre-activations
        
        current = X
        for i in range(self.n_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z.append(z)
            current = self.sigmoid(z)
            self.a.append(current)
        
        return current
    
    def backward(self, y):
        """Backward pass, computing gradients."""
        m = y.shape[0]
        self.grad_W = []
        self.grad_b = []
        
        # Output layer gradient (for MSE loss + sigmoid)
        # ∂L/∂a[L] = (a[L] - y) for MSE
        grad_a = self.a[-1] - y.reshape(-1, 1)
        
        for i in range(self.n_layers - 1, -1, -1):
            # ∂L/∂z = ∂L/∂a * σ'(z)
            grad_z = grad_a * self.sigmoid_derivative(self.z[i])
            
            # ∂L/∂W = a[l-1].T @ ∂L/∂z
            grad_W = self.a[i].T @ grad_z / m
            
            # ∂L/∂b = sum(∂L/∂z)
            grad_b = np.mean(grad_z, axis=0, keepdims=True)
            
            # Store gradients (in reverse order)
            self.grad_W.insert(0, grad_W)
            self.grad_b.insert(0, grad_b)
            
            # ∂L/∂a[l-1] = ∂L/∂z @ W.T
            if i > 0:
                grad_a = grad_z @ self.weights[i].T
    
    def update(self, learning_rate):
        """Update parameters using computed gradients."""
        for i in range(self.n_layers):
            self.weights[i] -= learning_rate * self.grad_W[i]
            self.biases[i] -= learning_rate * self.grad_b[i]
    
    def train(self, X, y, epochs=1000, learning_rate=0.1):
        """Training loop."""
        losses = []
        
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (MSE)
            loss = np.mean((output - y.reshape(-1, 1))**2)
            losses.append(loss)
            
            # Backward
            self.backward(y)
            
            # Update
            self.update(learning_rate)
            
            if epoch % 100 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.6f}")
        
        return losses


# Test on XOR
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

net = BackpropNetwork([2, 4, 1])
losses = net.train(X_xor, y_xor, epochs=3000, learning_rate=1.0)

print("\nResults:")
predictions = net.forward(X_xor)
for x, y_true, y_pred in zip(X_xor, y_xor, predictions):
    print(f"  {x} -> {y_pred[0]:.4f} (true: {y_true})")

Gradient Checking

How do you know your gradients are correct? Numerical gradient checking: fθf(θ+ϵ)f(θϵ)2ϵ\frac{\partial f}{\partial \theta} \approx \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}
def gradient_check(net, X, y, epsilon=1e-7):
    """Check gradients numerically."""
    # Get analytical gradients
    net.forward(X)
    net.backward(y)
    
    # Check each weight matrix
    for layer_idx in range(len(net.weights)):
        W = net.weights[layer_idx]
        grad_analytical = net.grad_W[layer_idx]
        
        grad_numerical = np.zeros_like(W)
        
        # Check each weight
        for i in range(W.shape[0]):
            for j in range(W.shape[1]):
                # Compute f(θ + ε)
                W[i, j] += epsilon
                output_plus = net.forward(X)
                loss_plus = np.mean((output_plus - y.reshape(-1, 1))**2)
                
                # Compute f(θ - ε)
                W[i, j] -= 2 * epsilon
                output_minus = net.forward(X)
                loss_minus = np.mean((output_minus - y.reshape(-1, 1))**2)
                
                # Restore weight
                W[i, j] += epsilon
                
                # Numerical gradient
                grad_numerical[i, j] = (loss_plus - loss_minus) / (2 * epsilon)
        
        # Compare
        difference = np.linalg.norm(grad_analytical - grad_numerical)
        norm_sum = np.linalg.norm(grad_analytical) + np.linalg.norm(grad_numerical)
        relative_error = difference / (norm_sum + 1e-8)
        
        print(f"Layer {layer_idx}: Relative error = {relative_error:.2e}")
        
        if relative_error > 1e-5:
            print("  WARNING: Gradient might be incorrect!")

# Run gradient check
gradient_check(net, X_xor, y_xor)

Visualizing Gradient Flow

Gradient Flow Through Network

The Vanishing Gradient Problem

With sigmoid activation:
  • Derivative: σ(z)=σ(z)(1σ(z))0.25\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25
  • After 10 layers: 0.25100.0000010.25^{10} \approx 0.000001
Gradients shrink exponentially as they flow backward!
# Demonstration of vanishing gradients
def visualize_gradient_flow():
    """Show how gradients vanish in deep sigmoid networks."""
    import matplotlib.pyplot as plt
    
    # Simulate gradient flow through layers
    n_layers = 20
    gradient = 1.0
    gradients = [gradient]
    
    for _ in range(n_layers):
        # Sigmoid derivative at saturation is ~0.25, typical is 0.1-0.2
        gradient *= 0.2  # Typical sigmoid derivative
        gradients.append(gradient)
    
    plt.figure(figsize=(10, 5))
    plt.semilogy(gradients, 'b-o', linewidth=2)
    plt.xlabel('Layer (backward from output)')
    plt.ylabel('Gradient Magnitude (log scale)')
    plt.title('Vanishing Gradients in Deep Sigmoid Networks')
    plt.grid(True)
    plt.show()
    
    print(f"Gradient after 20 layers: {gradients[-1]:.2e}")

visualize_gradient_flow()
Solutions:
  1. ReLU activation: Gradient is 1 for positive inputs
  2. Batch normalization: Keeps activations in a good range
  3. Residual connections: Skip connections add gradients
  4. Better initialization: He/Xavier initialization

PyTorch Autograd

PyTorch handles all of this automatically:
import torch
import torch.nn as nn

# Define a simple network
model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

# Input (requires_grad=False for inputs)
X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y = torch.tensor([[0.], [1.], [1.], [0.]])

# Forward pass
output = model(X)
loss = nn.MSELoss()(output, y)

print(f"Loss: {loss.item():.4f}")

# Backward pass - AUTOMATIC!
loss.backward()

# Gradients are computed and stored
for name, param in model.named_parameters():
    print(f"{name}: grad shape = {param.grad.shape}")
    print(f"  grad sample: {param.grad.flatten()[:3]}")

How Autograd Works

PyTorch builds a computational graph during the forward pass:
# Create tensor with gradient tracking
x = torch.tensor([2.0], requires_grad=True)
w = torch.tensor([3.0], requires_grad=True)

# Operations build the graph
y = x * w + 1
z = y ** 2

# Backpropagate
z.backward()

print(f"dz/dx = {x.grad}")  # d/dx [(xw+1)²] = 2(xw+1)w = 2(7)(3) = 42
print(f"dz/dw = {w.grad}")  # d/dw [(xw+1)²] = 2(xw+1)x = 2(7)(2) = 28

Visualizing the Computation Graph

# Using torchviz to visualize
try:
    from torchviz import make_dot
    
    x = torch.tensor([2.0], requires_grad=True)
    w = torch.tensor([3.0], requires_grad=True)
    y = x * w + 1
    z = y ** 2
    
    dot = make_dot(z, params={"x": x, "w": w})
    dot.render("computation_graph", format="png", cleanup=True)
    print("Graph saved to computation_graph.png")
except ImportError:
    print("Install torchviz: pip install torchviz graphviz")

Common Backprop Patterns

Pattern 1: Add Gate

f=x+y    fx=1,fy=1f = x + y \implies \frac{\partial f}{\partial x} = 1, \quad \frac{\partial f}{\partial y} = 1 Gradient distributor: Passes gradient unchanged to both inputs.

Pattern 2: Multiply Gate

f=xy    fx=y,fy=xf = x \cdot y \implies \frac{\partial f}{\partial x} = y, \quad \frac{\partial f}{\partial y} = x Gradient switcher: Gradient to xx is scaled by yy and vice versa.

Pattern 3: Max Gate

f=max(x,y)    fx=1x>y,fy=1y>xf = \max(x, y) \implies \frac{\partial f}{\partial x} = \mathbf{1}_{x>y}, \quad \frac{\partial f}{\partial y} = \mathbf{1}_{y>x} Gradient router: Gradient flows only to the larger input.

Pattern 4: ReLU

f=max(0,x)    fx=1x>0f = \max(0, x) \implies \frac{\partial f}{\partial x} = \mathbf{1}_{x>0} Gradient gate: Passes gradient if input was positive, blocks if negative.
Backpropagation Patterns

Exercises

Compute the gradients by hand for this simple network on the input x=1x=1:f=σ(σ(xw1+b1)w2+b2)f = \sigma(\sigma(xw_1 + b_1)w_2 + b_2)With w1=0.5w_1 = 0.5, b1=0.1b_1 = 0.1, w2=0.3w_2 = -0.3, b2=0.2b_2 = 0.2.Verify your answer with PyTorch.
Implement a custom activation function with PyTorch autograd:
class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input
Test that it gives the same results as nn.ReLU().
Create a 50-layer network with:
  1. Sigmoid activations
  2. ReLU activations
  3. Tanh activations
Measure the gradient magnitude at each layer. Plot and compare.
Implement the backward pass for batch normalization:Forward: x^=xμσ2+ϵ\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}The backward pass involves computing Lx\frac{\partial L}{\partial x}, Lγ\frac{\partial L}{\partial \gamma}, Lβ\frac{\partial L}{\partial \beta}.

Key Takeaways

ConceptKey Insight
Chain RuleMultiply gradients along paths
Computational GraphBreak complex functions into simple operations
Forward PassCompute outputs, cache intermediates
Backward PassCompute gradients from output to input
Gradient CheckingVerify with numerical gradients
Vanishing GradientsSigmoid/tanh → use ReLU

What’s Next