Skip to main content
Automatic Differentiation

Automatic Differentiation

The Problem with Manual Gradients

In the previous modules, you computed gradients by hand. For simple functions like f(x)=x2f(x) = x^2, that’s easy. But what about this? f(x)=σ(iwiReLU(jvijxj+bi))f(x) = \sigma\left(\sum_i w_i \cdot \text{ReLU}\left(\sum_j v_{ij} \cdot x_j + b_i\right)\right) That’s a 2-layer neural network. Now imagine 100 layers. Nobody computes these gradients by hand. Automatic differentiation (autodiff) does it for you - and that’s what powers PyTorch and TensorFlow.
Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Chain Rule module
What You’ll Build: Your own mini autodiff system!

Three Ways to Compute Gradients

MethodHow It WorksProsCons
SymbolicAlgebraic manipulationExact formulasExplodes for complex functions
NumericalFinite differences: f(x+h)f(x)h\frac{f(x+h) - f(x)}{h}SimpleSlow, imprecise
AutomaticTracks operations, applies chain ruleFast, exactNeeds special implementation
import numpy as np

# Example function: f(x) = sin(x^2)
def f(x):
    return np.sin(x**2)

# Method 1: Symbolic (by hand)
def df_symbolic(x):
    # f'(x) = cos(x^2) * 2x
    return np.cos(x**2) * 2 * x

# Method 2: Numerical differentiation
def df_numerical(x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

# Compare at x = 1.5
x = 1.5
print(f"Symbolic derivative: {df_symbolic(x):.8f}")
print(f"Numerical derivative: {df_numerical(x):.8f}")
print(f"Difference: {abs(df_symbolic(x) - df_numerical(x)):.2e}")
The problem with numerical differentiation:
# For a function with 1 million parameters
n_params = 1_000_000

# Numerical: requires 2 million function evaluations!
numerical_cost = 2 * n_params

# Autodiff: just one forward + one backward pass
autodiff_cost = 2

print(f"Numerical: {numerical_cost:,} evaluations")
print(f"Autodiff: {autodiff_cost} passes")
print(f"Speedup: {numerical_cost / autodiff_cost:,}x")

The Core Idea: Computational Graphs

Every computation can be broken into a graph of simple operations.

Example: f(x1,x2)=(x1+x2)(x1x2)f(x_1, x_2) = (x_1 + x_2) \cdot (x_1 \cdot x_2)

# Forward pass: compute f(2, 3)
x1, x2 = 2, 3

# Step-by-step with intermediate variables
a = x1 + x2      # a = 5
b = x1 * x2      # b = 6
f = a * b        # f = 30

print(f"f({x1}, {x2}) = {f}")

# The computation graph:
#
#   x1 ─┬─→ [+] ──→ a ──┐
#       │               │
#       └─→ [*] ──→ b ──┼─→ [*] ──→ f
#       │               │
#   x2 ─┴───────────────┘

Backward Pass: Chain Rule Through the Graph

# To find ∂f/∂x1, trace paths from f back to x1

# Path 1: f → a → x1
# ∂f/∂a = b = 6
# ∂a/∂x1 = 1
# Contribution: 6 * 1 = 6

# Path 2: f → b → x1
# ∂f/∂b = a = 5
# ∂b/∂x1 = x2 = 3
# Contribution: 5 * 3 = 15

# Total: ∂f/∂x1 = 6 + 15 = 21

# Similarly for x2:
# Path 1: f → a → x2: 6 * 1 = 6
# Path 2: f → b → x2: 5 * 2 = 10
# Total: ∂f/∂x2 = 16

# Verify numerically
def f_func(x1, x2):
    return (x1 + x2) * (x1 * x2)

h = 1e-7
df_dx1 = (f_func(2 + h, 3) - f_func(2 - h, 3)) / (2 * h)
df_dx2 = (f_func(2, 3 + h) - f_func(2, 3 - h)) / (2 * h)

print(f"∂f/∂x1 = {df_dx1:.6f} (should be 21)")
print(f"∂f/∂x2 = {df_dx2:.6f} (should be 16)")

Build Your Own Autodiff System

Let’s implement a simple autodiff system from scratch!
class Value:
    """
    A differentiable value that tracks its gradient.
    Inspired by Andrej Karpathy's micrograd.
    """
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0  # Gradient of final output w.r.t. this value
        self._backward = lambda: None  # Function to compute local gradients
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        
        out._backward = _backward
        return out
    
    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        
        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += n * (self.data ** (n-1)) * out.grad
        
        out._backward = _backward
        return out
    
    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')
        
        def _backward():
            # d(relu)/dx = 1 if x > 0 else 0
            self.grad += (self.data > 0) * out.grad
        
        out._backward = _backward
        return out
    
    def backward(self):
        """Compute gradients using reverse-mode autodiff."""
        # Topological sort
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        # Set output gradient to 1
        self.grad = 1.0
        
        # Backpropagate through the graph
        for v in reversed(topo):
            v._backward()
    
    # Enable reverse operations (e.g., 2 * x)
    def __radd__(self, other):
        return self + other
    
    def __rmul__(self, other):
        return self * other
    
    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)

# Test our autodiff!
print("Testing our autodiff system:")
print("=" * 40)

# Example 1: Simple expression
x = Value(2.0)
y = Value(3.0)
z = x * y + x ** 2

print(f"z = x*y + x^2 at x=2, y=3")
print(f"z = {z.data}")

z.backward()
print(f"∂z/∂x = {x.grad} (should be y + 2x = 7)")
print(f"∂z/∂y = {y.grad} (should be x = 2)")

# Example 2: Neural network layer!
print("\n" + "=" * 40)
print("Neural network forward/backward pass:")

# Input
x1 = Value(1.0)
x2 = Value(2.0)

# Weights
w1 = Value(0.5)
w2 = Value(-0.5)
b = Value(0.1)

# Forward pass: neuron = relu(w1*x1 + w2*x2 + b)
neuron = (w1 * x1 + w2 * x2 + b).relu()

print(f"Input: x1={x1.data}, x2={x2.data}")
print(f"Weights: w1={w1.data}, w2={w2.data}, b={b.data}")
print(f"Output: {neuron.data}")

# Backward pass
neuron.backward()

print(f"\nGradients:")
print(f"  ∂neuron/∂w1 = {w1.grad}")
print(f"  ∂neuron/∂w2 = {w2.grad}")
print(f"  ∂neuron/∂b = {b.grad}")

Forward Mode vs Reverse Mode

There are two ways to do autodiff:
ModeComputesBest For
ForwardOne input’s effect on all outputsFew inputs, many outputs
ReverseAll inputs’ effect on one outputMany inputs, one output
Neural networks have millions of inputs (weights) and one output (loss) → Reverse mode wins!
# Forward mode: compute Jacobian column by column
# For f: R^n → R^m, need n passes to get full Jacobian

# Reverse mode: compute Jacobian row by row  
# For f: R^n → R^m, need m passes to get full Jacobian

# Neural network:
# - n = millions of weights
# - m = 1 (scalar loss)

print("Forward mode for neural network:")
print(f"  Passes needed: n = millions")

print("\nReverse mode for neural network:")
print(f"  Passes needed: m = 1")
print(f"  That's why backprop works!")

PyTorch: Autodiff in Action

import torch

# PyTorch uses reverse-mode autodiff
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Build computation graph
z = x * y + x ** 2

print(f"z = {z.item()}")

# Backward pass (computes gradients)
z.backward()

print(f"∂z/∂x = {x.grad.item()} (should be 7)")
print(f"∂z/∂y = {y.grad.item()} (should be 2)")

# Neural network example
print("\n" + "=" * 40)
print("PyTorch Neural Network Gradients:")

# Simple 2-layer network
torch.manual_seed(42)

# Input
X = torch.randn(1, 10)

# Layers (with gradients tracked)
W1 = torch.randn(10, 5, requires_grad=True)
W2 = torch.randn(5, 1, requires_grad=True)

# Forward pass
hidden = torch.relu(X @ W1)
output = hidden @ W2
loss = (output - 1.0) ** 2  # Squared error

print(f"Loss: {loss.item():.4f}")

# Backward pass (computes all gradients!)
loss.backward()

print(f"W1 gradient shape: {W1.grad.shape}")
print(f"W2 gradient shape: {W2.grad.shape}")
print(f"W1 gradient norm: {W1.grad.norm().item():.4f}")
print(f"W2 gradient norm: {W2.grad.norm().item():.4f}")

Common Pitfalls

1. Gradient Accumulation

# Gradients accumulate! Need to zero them between batches.
x = torch.tensor([2.0], requires_grad=True)

# First backward
y = x ** 2
y.backward()
print(f"After first backward: grad = {x.grad.item()}")

# Second backward WITHOUT zeroing
y = x ** 2
y.backward()
print(f"After second backward (accumulated!): grad = {x.grad.item()}")

# Correct way
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing and backward: grad = {x.grad.item()}")

2. Detaching from Graph

# Sometimes you don't want gradients to flow through
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach y from the graph
z = y.detach() * x  # Gradient won't flow through y

z.backward()
print(f"x.grad = {x.grad.item()}")  # Only from the direct x, not x^2

3. In-place Operations

# In-place operations can break autodiff
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# This WILL cause problems:
# y.add_(1)  # In-place addition

# Do this instead:
y = y + 1  # Creates new tensor

Visualizing Computation Graphs

def visualize_graph(root, filename='graph'):
    """Create a simple text visualization of the computation graph."""
    
    def trace(v, depth=0):
        indent = "  " * depth
        if v._op:
            print(f"{indent}├─ {v._op}{v.data:.4f} (grad: {v.grad:.4f})")
        else:
            print(f"{indent}├─ input: {v.data:.4f} (grad: {v.grad:.4f})")
        
        for child in v._prev:
            trace(child, depth + 1)
    
    print("Computation Graph (from output to inputs):")
    print("=" * 40)
    trace(root)

# Example
a = Value(2.0)
b = Value(3.0)
c = a * b
d = c + a
e = d ** 2

e.backward()
visualize_graph(e)

Practice Exercises

Problem: Extend the Value class with:
  • tanh() activation function
  • exp() function
  • Division operator
Hint: For tanh, ddxtanh(x)=1tanh2(x)\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)
Problem: Implement softmax for a vector and compute its gradient.The softmax function is: softmax(xi)=exi/jexj\text{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}The Jacobian is: si/xj=si(δijsj)\partial s_i / \partial x_j = s_i(\delta_{ij} - s_j)
Problem: Build a complete 2-layer neural network using only your Value class:
  1. Define network with random weights
  2. Forward pass on XOR data
  3. Compute MSE loss
  4. Backward pass
  5. Update weights with gradient descent

Summary

ConceptWhat It IsWhy It Matters
Computational GraphDAG of operationsEnables systematic gradient computation
Forward PassCompute function valueRecord operations for backward
Backward PassApply chain rule in reverseGet all gradients in one pass
Reverse ModeOutput → InputsEfficient for many params, one output
Gradient AccumulationGrads add upMust zero before each batch
Key Takeaway: Automatic differentiation is the engine of deep learning. It’s not magic - it’s just the chain rule applied systematically through a computational graph. Understanding autodiff helps you debug models, implement custom layers, and optimize training.