Automatic Differentiation
The Problem with Manual Gradients
Three Ways to Compute Gradients
The Core Idea: Computational Graphs
Example:
Backward Pass: Chain Rule Through the Graph
Build Your Own Autodiff System
Forward Mode vs Reverse Mode
PyTorch: Autodiff in Action
Common Pitfalls
1. Gradient Accumulation
2. Detaching from Graph
3. In-place Operations
Visualizing Computation Graphs
Practice Exercises
Summary

$Automatic Differentiation$

Automatic Differentiation

The Problem with Manual Gradients

In the previous modules, you computed gradients by hand. For simple functions like

f(x) = x^2

, that’s easy. But what about this?

f(x) = \sigma\left(\sum_i w_i \cdot \text{ReLU}\left(\sum_j v_{ij} \cdot x_j + b_i\right)\right)

That’s a 2-layer neural network. Now imagine 100 layers. Nobody computes these gradients by hand. Automatic differentiation (autodiff) does it for you - and that’s what powers PyTorch and TensorFlow.

Estimated Time: 3-4 hours
Difficulty: Intermediate
Prerequisites: Chain Rule module
What You’ll Build: Your own mini autodiff system!

Three Ways to Compute Gradients

Method	How It Works	Pros	Cons
Symbolic	Algebraic manipulation	Exact formulas	Explodes for complex functions
Numerical	Finite differences: $\frac{f(x+h) - f(x)}{h}$	Simple	Slow, imprecise
Automatic	Tracks operations, applies chain rule	Fast, exact	Needs special implementation

import numpy as np

# Example function: f(x) = sin(x^2)
def f(x):
    return np.sin(x**2)

# Method 1: Symbolic (by hand)
def df_symbolic(x):
    # f'(x) = cos(x^2) * 2x
    return np.cos(x**2) * 2 * x

# Method 2: Numerical differentiation
def df_numerical(x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

# Compare at x = 1.5
x = 1.5
print(f"Symbolic derivative: {df_symbolic(x):.8f}")
print(f"Numerical derivative: {df_numerical(x):.8f}")
print(f"Difference: {abs(df_symbolic(x) - df_numerical(x)):.2e}")

The problem with numerical differentiation:

# For a function with 1 million parameters
n_params = 1_000_000

# Numerical: requires 2 million function evaluations!
numerical_cost = 2 * n_params

# Autodiff: just one forward + one backward pass
autodiff_cost = 2

print(f"Numerical: {numerical_cost:,} evaluations")
print(f"Autodiff: {autodiff_cost} passes")
print(f"Speedup: {numerical_cost / autodiff_cost:,}x")

The Core Idea: Computational Graphs

Every computation can be broken into a graph of simple operations.

Example: $f(x_1, x_2) = (x_1 + x_2) \cdot (x_1 \cdot x_2)$

# Forward pass: compute f(2, 3)
x1, x2 = 2, 3

# Step-by-step with intermediate variables
a = x1 + x2      # a = 5
b = x1 * x2      # b = 6
f = a * b        # f = 30

print(f"f({x1}, {x2}) = {f}")

# The computation graph:
#
#   x1 ─┬─→ [+] ──→ a ──┐
#       │               │
#       └─→ [*] ──→ b ──┼─→ [*] ──→ f
#       │               │
#   x2 ─┴───────────────┘

Backward Pass: Chain Rule Through the Graph

# To find ∂f/∂x1, trace paths from f back to x1

# Path 1: f → a → x1
# ∂f/∂a = b = 6
# ∂a/∂x1 = 1
# Contribution: 6 * 1 = 6

# Path 2: f → b → x1
# ∂f/∂b = a = 5
# ∂b/∂x1 = x2 = 3
# Contribution: 5 * 3 = 15

# Total: ∂f/∂x1 = 6 + 15 = 21

# Similarly for x2:
# Path 1: f → a → x2: 6 * 1 = 6
# Path 2: f → b → x2: 5 * 2 = 10
# Total: ∂f/∂x2 = 16

# Verify numerically
def f_func(x1, x2):
    return (x1 + x2) * (x1 * x2)

h = 1e-7
df_dx1 = (f_func(2 + h, 3) - f_func(2 - h, 3)) / (2 * h)
df_dx2 = (f_func(2, 3 + h) - f_func(2, 3 - h)) / (2 * h)

print(f"∂f/∂x1 = {df_dx1:.6f} (should be 21)")
print(f"∂f/∂x2 = {df_dx2:.6f} (should be 16)")

Build Your Own Autodiff System

Let’s implement a simple autodiff system from scratch!

class Value:
    """
    A differentiable value that tracks its gradient.
    Inspired by Andrej Karpathy's micrograd.
    """
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0  # Gradient of final output w.r.t. this value
        self._backward = lambda: None  # Function to compute local gradients
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        
        out._backward = _backward
        return out
    
    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        
        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            self.grad += n * (self.data ** (n-1)) * out.grad
        
        out._backward = _backward
        return out
    
    def relu(self):
        out = Value(max(0, self.data), (self,), 'ReLU')
        
        def _backward():
            # d(relu)/dx = 1 if x > 0 else 0
            self.grad += (self.data > 0) * out.grad
        
        out._backward = _backward
        return out
    
    def backward(self):
        """Compute gradients using reverse-mode autodiff."""
        # Topological sort
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        # Set output gradient to 1
        self.grad = 1.0
        
        # Backpropagate through the graph
        for v in reversed(topo):
            v._backward()
    
    # Enable reverse operations (e.g., 2 * x)
    def __radd__(self, other):
        return self + other
    
    def __rmul__(self, other):
        return self * other
    
    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)

# Test our autodiff!
print("Testing our autodiff system:")
print("=" * 40)

# Example 1: Simple expression
x = Value(2.0)
y = Value(3.0)
z = x * y + x ** 2

print(f"z = x*y + x^2 at x=2, y=3")
print(f"z = {z.data}")

z.backward()
print(f"∂z/∂x = {x.grad} (should be y + 2x = 7)")
print(f"∂z/∂y = {y.grad} (should be x = 2)")

# Example 2: Neural network layer!
print("\n" + "=" * 40)
print("Neural network forward/backward pass:")

# Input
x1 = Value(1.0)
x2 = Value(2.0)

# Weights
w1 = Value(0.5)
w2 = Value(-0.5)
b = Value(0.1)

# Forward pass: neuron = relu(w1*x1 + w2*x2 + b)
neuron = (w1 * x1 + w2 * x2 + b).relu()

print(f"Input: x1={x1.data}, x2={x2.data}")
print(f"Weights: w1={w1.data}, w2={w2.data}, b={b.data}")
print(f"Output: {neuron.data}")

# Backward pass
neuron.backward()

print(f"\nGradients:")
print(f"  ∂neuron/∂w1 = {w1.grad}")
print(f"  ∂neuron/∂w2 = {w2.grad}")
print(f"  ∂neuron/∂b = {b.grad}")

Forward Mode vs Reverse Mode

There are two ways to do autodiff:

Mode	Computes	Best For
Forward	One input’s effect on all outputs	Few inputs, many outputs
Reverse	All inputs’ effect on one output	Many inputs, one output

Neural networks have millions of inputs (weights) and one output (loss) → Reverse mode wins!

# Forward mode: compute Jacobian column by column
# For f: R^n → R^m, need n passes to get full Jacobian

# Reverse mode: compute Jacobian row by row  
# For f: R^n → R^m, need m passes to get full Jacobian

# Neural network:
# - n = millions of weights
# - m = 1 (scalar loss)

print("Forward mode for neural network:")
print(f"  Passes needed: n = millions")

print("\nReverse mode for neural network:")
print(f"  Passes needed: m = 1")
print(f"  That's why backprop works!")

PyTorch: Autodiff in Action

import torch

# PyTorch uses reverse-mode autodiff
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)

# Build computation graph
z = x * y + x ** 2

print(f"z = {z.item()}")

# Backward pass (computes gradients)
z.backward()

print(f"∂z/∂x = {x.grad.item()} (should be 7)")
print(f"∂z/∂y = {y.grad.item()} (should be 2)")

# Neural network example
print("\n" + "=" * 40)
print("PyTorch Neural Network Gradients:")

# Simple 2-layer network
torch.manual_seed(42)

# Input
X = torch.randn(1, 10)

# Layers (with gradients tracked)
W1 = torch.randn(10, 5, requires_grad=True)
W2 = torch.randn(5, 1, requires_grad=True)

# Forward pass
hidden = torch.relu(X @ W1)
output = hidden @ W2
loss = (output - 1.0) ** 2  # Squared error

print(f"Loss: {loss.item():.4f}")

# Backward pass (computes all gradients!)
loss.backward()

print(f"W1 gradient shape: {W1.grad.shape}")
print(f"W2 gradient shape: {W2.grad.shape}")
print(f"W1 gradient norm: {W1.grad.norm().item():.4f}")
print(f"W2 gradient norm: {W2.grad.norm().item():.4f}")

Common Pitfalls

1. Gradient Accumulation

# Gradients accumulate! Need to zero them between batches.
x = torch.tensor([2.0], requires_grad=True)

# First backward
y = x ** 2
y.backward()
print(f"After first backward: grad = {x.grad.item()}")

# Second backward WITHOUT zeroing
y = x ** 2
y.backward()
print(f"After second backward (accumulated!): grad = {x.grad.item()}")

# Correct way
x.grad.zero_()
y = x ** 2
y.backward()
print(f"After zeroing and backward: grad = {x.grad.item()}")

2. Detaching from Graph

# Sometimes you don't want gradients to flow through
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# Detach y from the graph
z = y.detach() * x  # Gradient won't flow through y

z.backward()
print(f"x.grad = {x.grad.item()}")  # Only from the direct x, not x^2

3. In-place Operations

# In-place operations can break autodiff
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

# This WILL cause problems:
# y.add_(1)  # In-place addition

# Do this instead:
y = y + 1  # Creates new tensor

Visualizing Computation Graphs

def visualize_graph(root, filename='graph'):
    """Create a simple text visualization of the computation graph."""
    
    def trace(v, depth=0):
        indent = "  " * depth
        if v._op:
            print(f"{indent}├─ {v._op} → {v.data:.4f} (grad: {v.grad:.4f})")
        else:
            print(f"{indent}├─ input: {v.data:.4f} (grad: {v.grad:.4f})")
        
        for child in v._prev:
            trace(child, depth + 1)
    
    print("Computation Graph (from output to inputs):")
    print("=" * 40)
    trace(root)

# Example
a = Value(2.0)
b = Value(3.0)
c = a * b
d = c + a
e = d ** 2

e.backward()
visualize_graph(e)

Practice Exercises

Exercise 1: Add More Operations

Problem: Extend the Value class with:

tanh() activation function
exp() function
Division operator

Hint: For tanh,

\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)

Exercise 2: Softmax Gradient

Problem: Implement softmax for a vector and compute its gradient.The softmax function is:

\text{softmax}(x_i) = e^{x_i} / \sum_j e^{x_j}

The Jacobian is:

\partial s_i / \partial x_j = s_i(\delta_{ij} - s_j)

Exercise 3: Implement from Scratch

Problem: Build a complete 2-layer neural network using only your Value class:

Define network with random weights
Forward pass on XOR data
Compute MSE loss
Backward pass
Update weights with gradient descent

Summary

Concept	What It Is	Why It Matters
Computational Graph	DAG of operations	Enables systematic gradient computation
Forward Pass	Compute function value	Record operations for backward
Backward Pass	Apply chain rule in reverse	Get all gradients in one pass
Reverse Mode	Output → Inputs	Efficient for many params, one output
Gradient Accumulation	Grads add up	Must zero before each batch

Key Takeaway: Automatic differentiation is the engine of deep learning. It’s not magic - it’s just the chain rule applied systematically through a computational graph. Understanding autodiff helps you debug models, implement custom layers, and optimize training.

Final Project Convexity

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Automatic Differentiation

​The Problem with Manual Gradients

​Three Ways to Compute Gradients

​The Core Idea: Computational Graphs

​Example: f(x1,x2)=(x1+x2)⋅(x1⋅x2)f(x_1, x_2) = (x_1 + x_2) \cdot (x_1 \cdot x_2)f(x1​,x2​)=(x1​+x2​)⋅(x1​⋅x2​)

​Backward Pass: Chain Rule Through the Graph

​Build Your Own Autodiff System

​Forward Mode vs Reverse Mode

​PyTorch: Autodiff in Action

​Common Pitfalls

​1. Gradient Accumulation

​2. Detaching from Graph

​3. In-place Operations

​Visualizing Computation Graphs

​Practice Exercises

​Summary

Automatic Differentiation

The Problem with Manual Gradients

Three Ways to Compute Gradients

The Core Idea: Computational Graphs

Example: $f(x_1, x_2) = (x_1 + x_2) \cdot (x_1 \cdot x_2)$

Backward Pass: Chain Rule Through the Graph

Build Your Own Autodiff System

Forward Mode vs Reverse Mode

PyTorch: Autodiff in Action

Common Pitfalls

1. Gradient Accumulation

2. Detaching from Graph

3. In-place Operations

Visualizing Computation Graphs

Practice Exercises

Summary