Backpropagation Deep Dive
The Central Question
The Core Insight: Credit Assignment
Computational Graphs
A Simple Example
Backward Pass with Chain Rule
The Chain Rule: The Key to Everything
Intuition
Backpropagation in a Neural Network
Single Neuron
Full Backpropagation Algorithm
Algorithm
Implementation from Scratch
Gradient Checking
Visualizing Gradient Flow
The Vanishing Gradient Problem
PyTorch Autograd
How Autograd Works
Visualizing the Computation Graph
Common Backprop Patterns
Pattern 1: Add Gate
Pattern 2: Multiply Gate
Pattern 3: Max Gate
Pattern 4: ReLU
Exercises
Key Takeaways
What’s Next

Backpropagation Deep Dive

The Central Question

You have a neural network with millions of parameters. The network makes a prediction. The prediction is wrong. How do you know which of those millions of parameters to adjust, and by how much? This is the problem backpropagation solves. It’s the algorithm that makes deep learning possible.

Historical Context: Backpropagation was independently discovered multiple times, but Rumelhart, Hinton, and Williams’ 1986 paper popularized it. It remained largely unused until 2012 because:

Computers weren’t fast enough
Data wasn’t available
Techniques to train deep networks weren’t developed

Today, it runs trillions of times per second in data centers worldwide.

The Core Insight: Credit Assignment

Imagine a factory assembly line:

Raw Materials → [Worker 1] → [Worker 2] → [Worker 3] → Defective Product

The final product is defective. Who’s at fault? You need to trace backward through the assembly line to find where things went wrong. That’s exactly what backpropagation does.

Computational Graphs

To understand backpropagation, we first need to see neural networks as computational graphs.

A Simple Example

Let’s compute

f(x, y, z) = (x + y) \cdot z

       [+]
      /   \
     x     y
       \   /
        [*]──── z
          \
           f

import numpy as np

# Forward pass
def forward(x, y, z):
    a = x + y      # Intermediate result
    f = a * z      # Final result
    return f, a    # Return intermediate for backprop

x, y, z = 2, 3, 4
f, a = forward(x, y, z)
print(f"f = ({x} + {y}) × {z} = {a} × {z} = {f}")

Backward Pass with Chain Rule

To compute

\frac{\partial f}{\partial x}

, we apply the chain rule:

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial a} \cdot \frac{\partial a}{\partial x}

Let’s trace through:

$f = a \cdot z$ , so $\frac{\partial f}{\partial a} = z = 4$
$a = x + y$ , so $\frac{\partial a}{\partial x} = 1$
Therefore: $\frac{\partial f}{\partial x} = 4 \cdot 1 = 4$

# Backward pass
def backward(z, a, grad_f=1.0):
    """
    grad_f: gradient of loss w.r.t. f (usually 1.0 if f is the loss)
    """
    # df/da = z (from f = a * z)
    grad_a = grad_f * z
    
    # df/dz = a (from f = a * z)
    grad_z = grad_f * a
    
    # da/dx = 1 (from a = x + y)
    grad_x = grad_a * 1
    
    # da/dy = 1 (from a = x + y)
    grad_y = grad_a * 1
    
    return grad_x, grad_y, grad_z

grad_x, grad_y, grad_z = backward(z, a)
print(f"∂f/∂x = {grad_x}, ∂f/∂y = {grad_y}, ∂f/∂z = {grad_z}")

The Chain Rule: The Key to Everything

The chain rule states:

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

Or for a chain of functions

f(g(h(x)))

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}

This is the ONLY math you need for backpropagation!

Intuition

g

doubles when

x

increases by 1, and

f

triples when

g

doubles:

Then $f$ should increase by $3 \times 2 = 6$ when $x$ increases by 1

The chain rule is about multiplying sensitivities along a path.

Backpropagation in a Neural Network

Now let’s apply this to an actual neural network layer.

Single Neuron

A single neuron computes:

y = \sigma(wx + b)

where

\sigma

is the activation function (e.g., sigmoid). Forward pass:

def neuron_forward(x, w, b):
    z = w * x + b          # Linear combination
    y = sigmoid(z)          # Activation
    return y, z             # Cache z for backward pass

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Backward pass: We need

\frac{\partial L}{\partial w}

and

\frac{\partial L}{\partial b}

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}

def neuron_backward(x, z, grad_y):
    """
    grad_y: gradient of loss w.r.t. output y (∂L/∂y)
    """
    # ∂y/∂z: sigmoid derivative
    sig = sigmoid(z)
    grad_z = grad_y * sig * (1 - sig)
    
    # ∂z/∂w = x, ∂z/∂b = 1, ∂z/∂x = w
    grad_w = grad_z * x
    grad_b = grad_z * 1
    grad_x = grad_z * w  # For passing to previous layer
    
    return grad_w, grad_b, grad_x

Full Backpropagation Algorithm

For a multi-layer network, backpropagation works as follows:

Algorithm

1. FORWARD PASS:
   For each layer l = 1, 2, ..., L:
       z[l] = W[l] @ a[l-1] + b[l]    # Linear
       a[l] = activation(z[l])         # Non-linear

2. COMPUTE LOSS:
   L = loss_function(a[L], y_true)

3. BACKWARD PASS:
   Compute ∂L/∂a[L] (depends on loss function)
   
   For each layer l = L, L-1, ..., 1:
       ∂L/∂z[l] = ∂L/∂a[l] * activation'(z[l])  # Element-wise
       ∂L/∂W[l] = ∂L/∂z[l] @ a[l-1].T
       ∂L/∂b[l] = sum(∂L/∂z[l])
       ∂L/∂a[l-1] = W[l].T @ ∂L/∂z[l]          # Pass to prev layer

4. UPDATE PARAMETERS:
   W[l] = W[l] - α * ∂L/∂W[l]
   b[l] = b[l] - α * ∂L/∂b[l]

Implementation from Scratch

class BackpropNetwork:
    """Neural network with manual backpropagation."""
    
    def __init__(self, layer_sizes):
        self.n_layers = len(layer_sizes) - 1
        self.weights = []
        self.biases = []
        
        for i in range(self.n_layers):
            W = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.1
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(W)
            self.biases.append(b)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """Forward pass, caching values for backprop."""
        self.a = [X]  # Activations
        self.z = []   # Pre-activations
        
        current = X
        for i in range(self.n_layers):
            z = current @ self.weights[i] + self.biases[i]
            self.z.append(z)
            current = self.sigmoid(z)
            self.a.append(current)
        
        return current
    
    def backward(self, y):
        """Backward pass, computing gradients."""
        m = y.shape[0]
        self.grad_W = []
        self.grad_b = []
        
        # Output layer gradient (for MSE loss + sigmoid)
        # ∂L/∂a[L] = (a[L] - y) for MSE
        grad_a = self.a[-1] - y.reshape(-1, 1)
        
        for i in range(self.n_layers - 1, -1, -1):
            # ∂L/∂z = ∂L/∂a * σ'(z)
            grad_z = grad_a * self.sigmoid_derivative(self.z[i])
            
            # ∂L/∂W = a[l-1].T @ ∂L/∂z
            grad_W = self.a[i].T @ grad_z / m
            
            # ∂L/∂b = sum(∂L/∂z)
            grad_b = np.mean(grad_z, axis=0, keepdims=True)
            
            # Store gradients (in reverse order)
            self.grad_W.insert(0, grad_W)
            self.grad_b.insert(0, grad_b)
            
            # ∂L/∂a[l-1] = ∂L/∂z @ W.T
            if i > 0:
                grad_a = grad_z @ self.weights[i].T
    
    def update(self, learning_rate):
        """Update parameters using computed gradients."""
        for i in range(self.n_layers):
            self.weights[i] -= learning_rate * self.grad_W[i]
            self.biases[i] -= learning_rate * self.grad_b[i]
    
    def train(self, X, y, epochs=1000, learning_rate=0.1):
        """Training loop."""
        losses = []
        
        for epoch in range(epochs):
            # Forward
            output = self.forward(X)
            
            # Loss (MSE)
            loss = np.mean((output - y.reshape(-1, 1))**2)
            losses.append(loss)
            
            # Backward
            self.backward(y)
            
            # Update
            self.update(learning_rate)
            
            if epoch % 100 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.6f}")
        
        return losses


# Test on XOR
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

net = BackpropNetwork([2, 4, 1])
losses = net.train(X_xor, y_xor, epochs=3000, learning_rate=1.0)

print("\nResults:")
predictions = net.forward(X_xor)
for x, y_true, y_pred in zip(X_xor, y_xor, predictions):
    print(f"  {x} -> {y_pred[0]:.4f} (true: {y_true})")

Gradient Checking

How do you know your gradients are correct? Numerical gradient checking:

\frac{\partial f}{\partial \theta} \approx \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}

def gradient_check(net, X, y, epsilon=1e-7):
    """Check gradients numerically."""
    # Get analytical gradients
    net.forward(X)
    net.backward(y)
    
    # Check each weight matrix
    for layer_idx in range(len(net.weights)):
        W = net.weights[layer_idx]
        grad_analytical = net.grad_W[layer_idx]
        
        grad_numerical = np.zeros_like(W)
        
        # Check each weight
        for i in range(W.shape[0]):
            for j in range(W.shape[1]):
                # Compute f(θ + ε)
                W[i, j] += epsilon
                output_plus = net.forward(X)
                loss_plus = np.mean((output_plus - y.reshape(-1, 1))**2)
                
                # Compute f(θ - ε)
                W[i, j] -= 2 * epsilon
                output_minus = net.forward(X)
                loss_minus = np.mean((output_minus - y.reshape(-1, 1))**2)
                
                # Restore weight
                W[i, j] += epsilon
                
                # Numerical gradient
                grad_numerical[i, j] = (loss_plus - loss_minus) / (2 * epsilon)
        
        # Compare
        difference = np.linalg.norm(grad_analytical - grad_numerical)
        norm_sum = np.linalg.norm(grad_analytical) + np.linalg.norm(grad_numerical)
        relative_error = difference / (norm_sum + 1e-8)
        
        print(f"Layer {layer_idx}: Relative error = {relative_error:.2e}")
        
        if relative_error > 1e-5:
            print("  WARNING: Gradient might be incorrect!")

# Run gradient check
gradient_check(net, X_xor, y_xor)

Visualizing Gradient Flow

The Vanishing Gradient Problem

With sigmoid activation:

Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z)) \leq 0.25$
After 10 layers: $0.25^{10} \approx 0.000001$

Gradients shrink exponentially as they flow backward!

# Demonstration of vanishing gradients
def visualize_gradient_flow():
    """Show how gradients vanish in deep sigmoid networks."""
    import matplotlib.pyplot as plt
    
    # Simulate gradient flow through layers
    n_layers = 20
    gradient = 1.0
    gradients = [gradient]
    
    for _ in range(n_layers):
        # Sigmoid derivative at saturation is ~0.25, typical is 0.1-0.2
        gradient *= 0.2  # Typical sigmoid derivative
        gradients.append(gradient)
    
    plt.figure(figsize=(10, 5))
    plt.semilogy(gradients, 'b-o', linewidth=2)
    plt.xlabel('Layer (backward from output)')
    plt.ylabel('Gradient Magnitude (log scale)')
    plt.title('Vanishing Gradients in Deep Sigmoid Networks')
    plt.grid(True)
    plt.show()
    
    print(f"Gradient after 20 layers: {gradients[-1]:.2e}")

visualize_gradient_flow()

Solutions:

ReLU activation: Gradient is 1 for positive inputs
Batch normalization: Keeps activations in a good range
Residual connections: Skip connections add gradients
Better initialization: He/Xavier initialization

PyTorch Autograd

PyTorch handles all of this automatically:

import torch
import torch.nn as nn

# Define a simple network
model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

# Input (requires_grad=False for inputs)
X = torch.tensor([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y = torch.tensor([[0.], [1.], [1.], [0.]])

# Forward pass
output = model(X)
loss = nn.MSELoss()(output, y)

print(f"Loss: {loss.item():.4f}")

# Backward pass - AUTOMATIC!
loss.backward()

# Gradients are computed and stored
for name, param in model.named_parameters():
    print(f"{name}: grad shape = {param.grad.shape}")
    print(f"  grad sample: {param.grad.flatten()[:3]}")

How Autograd Works

PyTorch builds a computational graph during the forward pass:

# Create tensor with gradient tracking
x = torch.tensor([2.0], requires_grad=True)
w = torch.tensor([3.0], requires_grad=True)

# Operations build the graph
y = x * w + 1
z = y ** 2

# Backpropagate
z.backward()

print(f"dz/dx = {x.grad}")  # d/dx [(xw+1)²] = 2(xw+1)w = 2(7)(3) = 42
print(f"dz/dw = {w.grad}")  # d/dw [(xw+1)²] = 2(xw+1)x = 2(7)(2) = 28

Visualizing the Computation Graph

# Using torchviz to visualize
try:
    from torchviz import make_dot
    
    x = torch.tensor([2.0], requires_grad=True)
    w = torch.tensor([3.0], requires_grad=True)
    y = x * w + 1
    z = y ** 2
    
    dot = make_dot(z, params={"x": x, "w": w})
    dot.render("computation_graph", format="png", cleanup=True)
    print("Graph saved to computation_graph.png")
except ImportError:
    print("Install torchviz: pip install torchviz graphviz")

Common Backprop Patterns

Pattern 1: Add Gate

f = x + y \implies \frac{\partial f}{\partial x} = 1, \quad \frac{\partial f}{\partial y} = 1

Gradient distributor: Passes gradient unchanged to both inputs.

Pattern 2: Multiply Gate

f = x \cdot y \implies \frac{\partial f}{\partial x} = y, \quad \frac{\partial f}{\partial y} = x

Gradient switcher: Gradient to

x

is scaled by

y

and vice versa.

Pattern 3: Max Gate

f = \max(x, y) \implies \frac{\partial f}{\partial x} = \mathbf{1}_{x>y}, \quad \frac{\partial f}{\partial y} = \mathbf{1}_{y>x}

Gradient router: Gradient flows only to the larger input.

Pattern 4: ReLU

f = \max(0, x) \implies \frac{\partial f}{\partial x} = \mathbf{1}_{x>0}

Gradient gate: Passes gradient if input was positive, blocks if negative.

Exercises

Exercise 1: Manual Backprop

Compute the gradients by hand for this simple network on the input

x=1

f = \sigma(\sigma(xw_1 + b_1)w_2 + b_2)

With

w_1 = 0.5

b_1 = 0.1

w_2 = -0.3

b_2 = 0.2

.Verify your answer with PyTorch.

Exercise 2: Custom Autograd Function

Implement a custom activation function with PyTorch autograd:

class MyReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

Test that it gives the same results as nn.ReLU().

Exercise 3: Gradient Flow Analysis

Create a 50-layer network with:

Sigmoid activations
ReLU activations
Tanh activations

Measure the gradient magnitude at each layer. Plot and compare.

Exercise 4: Implement Batch Norm Backward

Implement the backward pass for batch normalization:Forward:

\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

The backward pass involves computing

\frac{\partial L}{\partial x}

\frac{\partial L}{\partial \gamma}

\frac{\partial L}{\partial \beta}

Key Takeaways

Concept	Key Insight
Chain Rule	Multiply gradients along paths
Computational Graph	Break complex functions into simple operations
Forward Pass	Compute outputs, cache intermediates
Backward Pass	Compute gradients from output to input
Gradient Checking	Verify with numerical gradients
Vanishing Gradients	Sigmoid/tanh → use ReLU

What’s Next

Module 4: Activation Functions

Deep dive into ReLU, GELU, Swish, and when to use which activation.

Perceptrons & MLPs Activation Functions

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Backpropagation Deep Dive

​The Central Question

​The Core Insight: Credit Assignment

​Computational Graphs

​A Simple Example

​Backward Pass with Chain Rule

​The Chain Rule: The Key to Everything

​Intuition

​Backpropagation in a Neural Network

​Single Neuron

​Full Backpropagation Algorithm

​Algorithm

​Implementation from Scratch

​Gradient Checking

​Visualizing Gradient Flow

​The Vanishing Gradient Problem

​PyTorch Autograd

​How Autograd Works

​Visualizing the Computation Graph

​Common Backprop Patterns

​Pattern 1: Add Gate

​Pattern 2: Multiply Gate

​Pattern 3: Max Gate

​Pattern 4: ReLU

​Exercises

​Key Takeaways

​What’s Next