Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Automatic Differentiation

Automatic Differentiation

The Problem with Manual Gradients

In the previous modules, you computed gradients by hand. For simple functions like f(x)=x2f(x) = x^2, that’s easy. But what about this? f(x)=σ(iwiReLU(jvijxj+bi))f(x) = \sigma\left(\sum_i w_i \cdot \text{ReLU}\left(\sum_j v_{ij} \cdot x_j + b_i\right)\right) That’s a 2-layer neural network. Now imagine 100 layers. Nobody computes these gradients by hand. Automatic differentiation (autodiff) does it for you - and that’s what powers PyTorch and TensorFlow.
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Chain Rule module
What You’ll Build: Your own mini autodiff system!

Three Ways to Compute Gradients

Think of computing derivatives as three different approaches to getting driving directions:
  • Symbolic differentiation is like working out the route with pen and paper, applying road rules algebraically. You get a perfect formula, but it gets impossibly complex for a cross-country trip with thousands of turns.
  • Numerical differentiation is like driving the route twice with slightly different starting positions and comparing where you end up. Simple but slow (you literally run the function twice per parameter), and small measurement errors accumulate.
  • Automatic differentiation is like having a co-pilot who records every turn you make and can replay the route backward, noting exactly which turns contributed to going north vs. south. One forward trip plus one backward replay, and you know the derivative with respect to every single starting condition.
MethodHow It WorksProsCons
SymbolicAlgebraic manipulationExact formulasExplodes for complex functions
NumericalFinite differences: f(x+h)f(x)h\frac{f(x+h) - f(x)}{h}SimpleSlow (O(n)O(n) evaluations per parameter), imprecise
AutomaticTracks operations, applies chain ruleFast (same cost as forward pass), exactNeeds special implementation
import numpy as np

# Example function: f(x) = sin(x^2)
def f(x):
    return np.sin(x**2)

# Method 1: Symbolic (by hand)
def df_symbolic(x):
    # f'(x) = cos(x^2) * 2x
    return np.cos(x**2) * 2 * x

# Method 2: Numerical differentiation
def df_numerical(x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

# Compare at x = 1.5
x = 1.5
print(f"Symbolic derivative: {df_symbolic(x):.8f}")
print(f"Numerical derivative: {df_numerical(x):.8f}")
print(f"Difference: {abs(df_symbolic(x) - df_numerical(x)):.2e}")
The problem with numerical differentiation:
# For a function with 1 million parameters
n_params = 1_000_000

# Numerical: requires 2 million function evaluations!
numerical_cost = 2 * n_params

# Autodiff: just one forward + one backward pass
autodiff_cost = 2

print(f"Numerical: {numerical_cost:,} evaluations")
print(f"Autodiff: {autodiff_cost} passes")
print(f"Speedup: {numerical_cost / autodiff_cost:,}x")

The Core Idea: Computational Graphs

Every computation can be broken into a graph of simple operations.

Example: f(x1,x2)=(x1+x2)(x1x2)f(x_1, x_2) = (x_1 + x_2) \cdot (x_1 \cdot x_2)

# Forward pass: compute f(2, 3)
x1, x2 = 2, 3

# Step-by-step with intermediate variables
a = x1 + x2      # a = 5
b = x1 * x2      # b = 6
f = a * b        # f = 30

print(f"f({x1}, {x2}) = {f}")

# The computation graph:
#
#   x1 ─┬─→ [+] ──→ a ──┐
#       │               │
#       └─→ [*] ──→ b ──┼─→ [*] ──→ f
#       │               │
#   x2 ─┴───────────────┘

Backward Pass: Chain Rule Through the Graph

# To find ∂f/∂x1, trace paths from f back to x1

# Path 1: f → a → x1
# ∂f/∂a = b = 6
# ∂a/∂x1 = 1
# Contribution: 6 * 1 = 6

# Path 2: f → b → x1
# ∂f/∂b = a = 5
# ∂b/∂x1 = x2 = 3
# Contribution: 5 * 3 = 15

# Total: ∂f/∂x1 = 6 + 15 = 21

# Similarly for x2:
# Path 1: f → a → x2: 6 * 1 = 6
# Path 2: f → b → x2: 5 * 2 = 10
# Total: ∂f/∂x2 = 16

# Verify numerically
def f_func(x1, x2):
    return (x1 + x2) * (x1 * x2)

h = 1e-7
df_dx1 = (f_func(2 + h, 3) - f_func(2 - h, 3)) / (2 * h)
df_dx2 = (f_func(2, 3 + h) - f_func(2, 3 - h)) / (2 * h)

print(f"∂f/∂x1 = {df_dx1:.6f} (should be 21)")
print(f"∂f/∂x2 = {df_dx2:.6f} (should be 16)")

Build Your Own Autodiff System

Let’s implement a simple autodiff system from scratch!
class Value:
    """
    A differentiable value that tracks its gradient.
    Inspired by Andrej Karpathy's micrograd.
    """
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0  # Gradient of final output w.r.t. this value
        self._backward = lambda: None  # Function to compute local gradients
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        
        out._backward = _backward
        return out
    
    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        
        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += n * (self.data ** (n-1)) * out.grad
        
        out._backward = _backward
        return out
    
    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')
        
        def _backward():
            # d(relu)/dx = 1 if x > 0 else 0
            self.grad += (self.data > 0) * out.grad
        
        out._backward = _backward
        return out
    
    def backward(self):
        """Compute gradients using reverse-mode autodiff."""
        # Topological sort
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        # Set output gradient to 1
        self.grad = 1.0
        
        # Backpropagate through the graph
        for v in reversed(topo):
            v._backward()
    
    # Enable reverse operations (e.g., 2 * x)
    def __radd__(self, other):
        return self + other
    
    def __rmul__(self, other):
        return self * other
    
    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)

# Test our autodiff!
print("Testing our autodiff system:")
print("=" * 40)

# Example 1: Simple expression
x = Value(2.0)
y = Value(3.0)
z = x * y + x ** 2

print(f"z = x*y + x^2 at x=2, y=3")
print(f"z = {z.data}")

z.backward()
print(f"∂z/∂x = {x.grad} (should be y + 2x = 7)")
print(f"∂z/∂y = {y.grad} (should be x = 2)")

# Example 2: Neural network layer!
print("\n" + "=" * 40)
print("Neural network forward/backward pass:")

# Input
x1 = Value(1.0)
x2 = Value(2.0)

# Weights
w1 = Value(0.5)
w2 = Value(-0.5)
b = Value(0.1)

# Forward pass: neuron = relu(w1*x1 + w2*x2 + b)
neuron = (w1 * x1 + w2 * x2 + b).relu()

print(f"Input: x1={x1.data}, x2={x2.data}")
print(f"Weights: w1={w1.data}, w2={w2.data}, b={b.data}")
print(f"Output: {neuron.data}")

# Backward pass
neuron.backward()

print(f"\nGradients:")
print(f"  ∂neuron/∂w1 = {w1.grad}")
print(f"  ∂neuron/∂w2 = {w2.grad}")
print(f"  ∂neuron/∂b = {b.grad}")

Forward Mode vs Reverse Mode

There are two ways to do autodiff:
ModeComputesBest For
ForwardOne input’s effect on all outputsFew inputs, many outputs
ReverseAll inputs’ effect on one outputMany inputs, one output
Neural networks have millions of inputs (weights) and one output (loss) → Reverse mode wins!
# Forward mode: compute Jacobian column by column
# For f: R^n → R^m, need n passes to get full Jacobian

# Reverse mode: compute Jacobian row by row  
# For f: R^n → R^m, need m passes to get full Jacobian

# Neural network:
# - n = millions of weights
# - m = 1 (scalar loss)

print("Forward mode for neural network:")
print(f"  Passes needed: n = millions")

print("\nReverse mode for neural network:")
print(f"  Passes needed: m = 1")
print(f"  That's why backprop works!")

PyTorch: Autodiff in Action

import torch

# PyTorch uses reverse-mode autodiff
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Build computation graph
z = x * y + x ** 2

print(f"z = {z.item()}")

# Backward pass (computes gradients)
z.backward()

print(f"∂z/∂x = {x.grad.item()} (should be 7)")
print(f"∂z/∂y = {y.grad.item()} (should be 2)")

# Neural network example
print("\n" + "=" * 40)
print("PyTorch Neural Network Gradients:")

# Simple 2-layer network
torch.manual_seed(42)

# Input
X = torch.randn(1, 10)

# Layers (with gradients tracked)
W1 = torch.randn(10, 5, requires_grad=True)
W2 = torch.randn(5, 1, requires_grad=True)

# Forward pass
hidden = torch.relu(X @ W1)
output = hidden @ W2
loss = (output - 1.0) ** 2  # Squared error

print(f"Loss: {loss.item():.4f}")

# Backward pass (computes all gradients!)
loss.backward()

print(f"W1 gradient shape: {W1.grad.shape}")
print(f"W2 gradient shape: {W2.grad.shape}")
print(f"W1 gradient norm: {W1.grad.norm().item():.4f}")
print(f"W2 gradient norm: {W2.grad.norm().item():.4f}")

Common Pitfalls

1. Gradient Accumulation

# Gradients accumulate! Need to zero them between batches.
x = torch.tensor([2.0], requires_grad=True)

# First backward
y = x ** 2
y.backward()
print(f"After first backward: grad = {x.grad.item()}")

# Second backward WITHOUT zeroing
y = x ** 2
y.backward()
print(f"After second backward (accumulated!): grad = {x.grad.item()}")

# Correct way
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing and backward: grad = {x.grad.item()}")

2. Detaching from Graph

# Sometimes you don't want gradients to flow through
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach y from the graph
z = y.detach() * x  # Gradient won't flow through y

z.backward()
print(f"x.grad = {x.grad.item()}")  # Only from the direct x, not x^2

3. In-place Operations

# In-place operations can break autodiff
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# This WILL cause problems:
# y.add_(1)  # In-place addition

# Do this instead:
y = y + 1  # Creates new tensor

4. Numerical Precision in Gradient Checking

Getting Gradient Checking RightWhen verifying your custom autodiff against numerical gradients, the choice of h matters enormously. Too large and the finite difference is inaccurate; too small and floating-point cancellation ruins the result.The standard approach used by PyTorch’s gradcheck:
def gradient_check(f, x, analytical_grad, h=1e-6):
    """Verify analytical gradient against numerical gradient."""
    numerical_grad = (f(x + h) - f(x - h)) / (2 * h)
    
    # Use RELATIVE error, not absolute
    # This handles parameters of different magnitudes
    numerator = np.abs(analytical_grad - numerical_grad)
    denominator = max(np.abs(analytical_grad), np.abs(numerical_grad), 1e-8)
    relative_error = numerator / denominator
    
    if relative_error < 1e-5:
        return "PASS"
    elif relative_error < 1e-3:
        return "WARNING -- check for subtle bugs"
    else:
        return f"FAIL -- relative error {relative_error:.2e}"
Key points:
  • Always use centered differences (both +h and -h), not forward differences
  • Check relative error, not absolute error, because gradients can span many orders of magnitude
  • Temporarily disable dropout and batch normalization during gradient checking, as they introduce randomness that defeats the numerical comparison
  • If gradients pass the check with float64 but fail with float32, your code is correct but numerically sensitive — consider using Kahan summation or reorganizing computations to reduce cancellation

Visualizing Computation Graphs

def visualize_graph(root, filename='graph'):
    """Create a simple text visualization of the computation graph."""
    
    def trace(v, depth=0):
        indent = "  " * depth
        if v._op:
            print(f"{indent}├─ {v._op}{v.data:.4f} (grad: {v.grad:.4f})")
        else:
            print(f"{indent}├─ input: {v.data:.4f} (grad: {v.grad:.4f})")
        
        for child in v._prev:
            trace(child, depth + 1)
    
    print("Computation Graph (from output to inputs):")
    print("=" * 40)
    trace(root)

# Example
a = Value(2.0)
b = Value(3.0)
c = a * b
d = c + a
e = d ** 2

e.backward()
visualize_graph(e)

Practice Exercises

Problem: Extend the Value class with:
  • tanh() activation function
  • exp() function
  • Division operator
Hint: For tanh, ddxtanh(x)=1tanh2(x)\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)
Problem: Implement softmax for a vector and compute its gradient.The softmax function is: softmax(xi)=exi/jexj\text{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}The Jacobian is: si/xj=si(δijsj)\partial s_i / \partial x_j = s_i(\delta_{ij} - s_j)
Problem: Build a complete 2-layer neural network using only your Value class:
  1. Define network with random weights
  2. Forward pass on XOR data
  3. Compute MSE loss
  4. Backward pass
  5. Update weights with gradient descent

Summary

ConceptWhat It IsWhy It Matters
Computational GraphDAG of operationsEnables systematic gradient computation
Forward PassCompute function valueRecord operations for backward
Backward PassApply chain rule in reverseGet all gradients in one pass
Reverse ModeOutput → InputsEfficient for many params, one output
Gradient AccumulationGrads add upMust zero before each batch
Key Takeaway: Automatic differentiation is the engine of deep learning. It’s not magic - it’s just the chain rule applied systematically through a computational graph. Understanding autodiff helps you debug models, implement custom layers, and optimize training.

Interview Deep-Dive

Strong Answer:
  • Symbolic differentiation applies algebraic rules to produce an exact derivative formula. For f(x) = x^2 * sin(x), it gives f’(x) = 2xsin(x) + x^2cos(x). The problem is expression swell: for complex compositions, the symbolic derivative can be exponentially larger than the original expression. A 10-layer neural network’s loss function, symbolically differentiated, would produce an unmanageably large expression that is slow to evaluate even if you could derive it.
  • Numerical differentiation uses finite differences: (f(x+h) - f(x-h))/(2h). It is simple and works for any function you can evaluate. But it has two fatal flaws for deep learning: it requires O(n) function evaluations for n parameters (one per parameter), and it suffers from the truncation-cancellation trade-off where no choice of h gives both accuracy and stability simultaneously.
  • Automatic differentiation computes exact derivatives (to machine precision) at a cost proportional to the original function evaluation. It works by decomposing the computation into elementary operations and applying the chain rule through them. Reverse-mode autodiff (backpropagation) computes the gradient of a scalar output with respect to ALL inputs in a single backward pass, regardless of the number of inputs.
  • Deep learning chose autodiff because it is the only method that scales. With millions of parameters and a scalar loss, reverse-mode autodiff gives exact gradients in O(1) backward passes. Symbolic differentiation would produce an expression too large to store. Numerical differentiation would require millions of forward passes per gradient step. The cost ratio is not 2x or 10x — it is millions-to-one. This is literally what makes deep learning computationally feasible.
Follow-up: JAX offers both forward-mode and reverse-mode autodiff. Can you describe a scenario where you would compose the two modes?The classic use case is computing Hessian-vector products efficiently. The Hessian of a scalar function is the matrix of second derivatives. Computing the full Hessian is O(n^2) in space, which is prohibitive for large models. But a Hessian-vector product Hv can be computed in O(n) time and space by composing one forward-mode pass inside one reverse-mode pass. First, you use reverse-mode to compute the gradient g(theta). Then, you use forward-mode to compute the Jacobian-vector product of g with respect to theta in direction v, which gives Hv. In JAX, this is jax.jvp(jax.grad(loss), (params,), (v,)). This is used in conjugate gradient methods for second-order optimization, in the TRPO algorithm for reinforcement learning (which needs Hv for the Fisher-vector product), and in computing the spectral norm of the Hessian for loss landscape analysis.
Strong Answer:
  • In a static graph system (TensorFlow 1.x), you first define the entire computation as a graph object, then execute it in a separate “session.” The graph is compiled once and reused. This enables aggressive ahead-of-time optimizations: operation fusion, memory planning, dead code elimination, and cross-device placement. The downside is that Python control flow (if/else, loops) cannot be used naturally — you need special graph operations like tf.cond and tf.while_loop.
  • In a dynamic graph system (PyTorch, JAX in eager mode), the graph is built on-the-fly as Python code executes. Each line of Python immediately computes a value and appends a node to the graph. This means standard Python debugging tools (print statements, pdb, breakpoints) work naturally. Control flow is just regular Python.
  • The industry converged on dynamic graphs for a simple reason: researcher productivity. In ML research, you spend most of your time writing and debugging new model architectures. The ability to set a breakpoint, inspect intermediate tensors, and step through the computation in a standard debugger dramatically accelerates the research cycle. The performance overhead of dynamic graphs (typically 10-20% slower than optimized static graphs) is acceptable because researcher time is more expensive than GPU time.
  • The convergence is not absolute. TensorFlow 2.0 adopted eager execution by default but added tf.function for compiling hot paths into static graphs. PyTorch added torch.compile (PyTorch 2.0) which traces the dynamic graph and compiles it for performance. JAX takes a hybrid approach: eager by default, with jax.jit to compile functions. The modern consensus is: develop dynamically, deploy with compilation.
Follow-up: When you use torch.compile or jax.jit, what happens to Python control flow that depends on tensor values?This is where tracing-based compilation gets tricky. When you jit-compile a function, the compiler traces it with concrete input values and records the operations. If your code has a branch like “if x.sum() > 0: do_A() else: do_B()”, the compiler traces whichever branch is taken for the tracing inputs. If a future input takes the other branch, the compiled version is wrong. PyTorch’s torch.compile handles this with “graph breaks” — it detects data-dependent control flow and splits the compiled region, falling back to Python interpretation for the branch, then resuming compilation after. JAX’s jax.jit requires you to explicitly use jax.lax.cond for data-dependent branches, making the control flow visible to the compiler. In practice, most neural network forward passes have minimal data-dependent control flow, so compilation works well for the bulk of the computation.
Strong Answer:
  • You subclass torch.autograd.Function and implement two static methods: forward() and backward(). forward() receives input tensors and a context object, computes the output, and saves any tensors needed for the backward pass using ctx.save_for_backward(). backward() receives the upstream gradient (grad_output) and the context, and must return one gradient per forward input.
  • The contract is: backward must return tensors with the same shape as the corresponding forward inputs. If an input does not need a gradient (like an integer parameter), return None for that position. Getting the number and order of returned gradients wrong is one of the most common bugs.
  • Pitfall one: saving too much or too little in the context. If you forget to save an intermediate tensor and try to use a forward-pass local variable in backward, it will be garbage-collected and you get a crash or wrong result. If you save too many large tensors, you waste memory.
  • Pitfall two: in-place modification of saved tensors. If you save tensor A in the forward pass and then modify A in-place before backward runs, the saved reference points to the modified data, not the original. Always save copies if there is any chance of in-place modification.
  • Pitfall three: not handling batched inputs. Your backward must work for arbitrary batch sizes. A common bug is implementing the gradient for a single sample and then getting shape errors with batches.
  • Always validate with torch.autograd.gradcheck, which compares your analytical backward against numerical finite differences. Run this with float64 tensors for maximum precision.
Follow-up: When would you need a custom backward pass instead of relying on PyTorch’s built-in autograd?Three main scenarios. First, when the autodiff-generated backward is numerically unstable but a manually-derived formula is stable. The log-determinant of a matrix is a classic example. Second, when the forward pass involves non-differentiable operations that you want to approximate with a straight-through estimator — for example, quantization (rounding to discrete values) has zero derivative everywhere, but you can define a custom backward that passes the gradient through unchanged. This is essential for training quantized neural networks. Third, memory optimization: sometimes the autodiff graph stores large intermediate tensors that you can avoid by implementing a mathematically equivalent backward that recomputes them from smaller saved quantities.
Strong Answer:
  • In PyTorch, calling loss.backward() ADDS the computed gradients to the .grad attribute of each parameter rather than replacing them. If you forget to call optimizer.zero_grad() before computing the next batch’s gradients, the gradients from the previous batch are still there, and the new batch’s gradients are added on top. The result is that your effective gradient is the sum of all batches since the last zero_grad(), which is mathematically wrong for standard SGD.
  • This is the single most common PyTorch bug for beginners. The symptom is that the model appears to train but converges to a worse solution or oscillates erratically. It is insidious because the code does not crash — it just silently computes wrong updates.
  • Why PyTorch does not zero automatically: gradient accumulation is a deliberate feature, not a bug. It enables training with effective batch sizes larger than what fits in GPU memory. If your GPU can hold batch size 8 but you want the gradient quality of batch size 32, you run 4 forward-backward passes (each with batch 8) without zeroing gradients, then call optimizer.step(). The accumulated gradient is mathematically equivalent to computing the gradient over all 32 samples at once.
  • The accumulation semantics also enable multi-task and multi-loss training. If you have two losses (classification loss + reconstruction loss), you can call loss1.backward() and loss2.backward() separately, and the gradients accumulate correctly.
  • Best practice: always call optimizer.zero_grad() at the start of each training iteration (before loss.backward()), not after optimizer.step().
Follow-up: If you are using gradient accumulation to simulate a larger batch size, do you need to scale the loss or gradients?Yes, you need to scale. If you accumulate over K micro-batches, each backward pass adds gradients as if the loss were computed over one micro-batch. The accumulated gradient is K times larger than the gradient from a single forward-backward pass on the full batch (because PyTorch averages over samples within a batch but sums across backward calls). You should divide the loss by K before calling backward (so each backward contributes 1/K of the gradient), or divide the gradient by K after accumulation before optimizer.step(). If you forget to scale, your effective learning rate is K times larger than intended, which can cause instability. The loss division approach is cleaner and more common.