In the previous modules, you computed gradients by hand. For simple functions like f(x)=x2, that’s easy.But what about this?f(x)=σ(i∑wi⋅ReLU(j∑vij⋅xj+bi))That’s a 2-layer neural network. Now imagine 100 layers. Nobody computes these gradients by hand.Automatic differentiation (autodiff) does it for you - and that’s what powers PyTorch and TensorFlow.
Estimated Time: 3-4 hours Difficulty: Intermediate Prerequisites: Chain Rule module What You’ll Build: Your own mini autodiff system!
# For a function with 1 million parametersn_params = 1_000_000# Numerical: requires 2 million function evaluations!numerical_cost = 2 * n_params# Autodiff: just one forward + one backward passautodiff_cost = 2print(f"Numerical: {numerical_cost:,} evaluations")print(f"Autodiff: {autodiff_cost} passes")print(f"Speedup: {numerical_cost / autodiff_cost:,}x")
Neural networks have millions of inputs (weights) and one output (loss) → Reverse mode wins!
Copy
# Forward mode: compute Jacobian column by column# For f: R^n → R^m, need n passes to get full Jacobian# Reverse mode: compute Jacobian row by row # For f: R^n → R^m, need m passes to get full Jacobian# Neural network:# - n = millions of weights# - m = 1 (scalar loss)print("Forward mode for neural network:")print(f" Passes needed: n = millions")print("\nReverse mode for neural network:")print(f" Passes needed: m = 1")print(f" That's why backprop works!")
# Gradients accumulate! Need to zero them between batches.x = torch.tensor([2.0], requires_grad=True)# First backwardy = x ** 2y.backward()print(f"After first backward: grad = {x.grad.item()}")# Second backward WITHOUT zeroingy = x ** 2y.backward()print(f"After second backward (accumulated!): grad = {x.grad.item()}")# Correct wayx.grad.zero_()y = x ** 2y.backward()print(f"After zeroing and backward: grad = {x.grad.item()}")
# Sometimes you don't want gradients to flow throughx = torch.tensor([2.0], requires_grad=True)y = x ** 2# Detach y from the graphz = y.detach() * x # Gradient won't flow through yz.backward()print(f"x.grad = {x.grad.item()}") # Only from the direct x, not x^2
# In-place operations can break autodiffx = torch.tensor([2.0], requires_grad=True)y = x ** 2# This WILL cause problems:# y.add_(1) # In-place addition# Do this instead:y = y + 1 # Creates new tensor
Problem: Implement softmax for a vector and compute its gradient.The softmax function is: softmax(xi)=exi/∑jexjThe Jacobian is: ∂si/∂xj=si(δij−sj)
Exercise 3: Implement from Scratch
Problem: Build a complete 2-layer neural network using only your Value class:
Key Takeaway: Automatic differentiation is the engine of deep learning. It’s not magic - it’s just the chain rule applied systematically through a computational graph. Understanding autodiff helps you debug models, implement custom layers, and optimize training.