In the previous modules, you computed gradients by hand. For simple functions like f(x)=x2, that’s easy.But what about this?f(x)=σ(i∑wi⋅ReLU(j∑vij⋅xj+bi))That’s a 2-layer neural network. Now imagine 100 layers. Nobody computes these gradients by hand.Automatic differentiation (autodiff) does it for you - and that’s what powers PyTorch and TensorFlow.
Estimated Time: 3-4 hours Difficulty: Intermediate Prerequisites: Chain Rule module What You’ll Build: Your own mini autodiff system!
Think of computing derivatives as three different approaches to getting driving directions:
Symbolic differentiation is like working out the route with pen and paper, applying road rules algebraically. You get a perfect formula, but it gets impossibly complex for a cross-country trip with thousands of turns.
Numerical differentiation is like driving the route twice with slightly different starting positions and comparing where you end up. Simple but slow (you literally run the function twice per parameter), and small measurement errors accumulate.
Automatic differentiation is like having a co-pilot who records every turn you make and can replay the route backward, noting exactly which turns contributed to going north vs. south. One forward trip plus one backward replay, and you know the derivative with respect to every single starting condition.
# For a function with 1 million parametersn_params = 1_000_000# Numerical: requires 2 million function evaluations!numerical_cost = 2 * n_params# Autodiff: just one forward + one backward passautodiff_cost = 2print(f"Numerical: {numerical_cost:,} evaluations")print(f"Autodiff: {autodiff_cost} passes")print(f"Speedup: {numerical_cost / autodiff_cost:,}x")
Neural networks have millions of inputs (weights) and one output (loss) → Reverse mode wins!
# Forward mode: compute Jacobian column by column# For f: R^n → R^m, need n passes to get full Jacobian# Reverse mode: compute Jacobian row by row # For f: R^n → R^m, need m passes to get full Jacobian# Neural network:# - n = millions of weights# - m = 1 (scalar loss)print("Forward mode for neural network:")print(f" Passes needed: n = millions")print("\nReverse mode for neural network:")print(f" Passes needed: m = 1")print(f" That's why backprop works!")
# Gradients accumulate! Need to zero them between batches.x = torch.tensor([2.0], requires_grad=True)# First backwardy = x ** 2y.backward()print(f"After first backward: grad = {x.grad.item()}")# Second backward WITHOUT zeroingy = x ** 2y.backward()print(f"After second backward (accumulated!): grad = {x.grad.item()}")# Correct wayx.grad.zero_()y = x ** 2y.backward()print(f"After zeroing and backward: grad = {x.grad.item()}")
# Sometimes you don't want gradients to flow throughx = torch.tensor([2.0], requires_grad=True)y = x ** 2# Detach y from the graphz = y.detach() * x # Gradient won't flow through yz.backward()print(f"x.grad = {x.grad.item()}") # Only from the direct x, not x^2
# In-place operations can break autodiffx = torch.tensor([2.0], requires_grad=True)y = x ** 2# This WILL cause problems:# y.add_(1) # In-place addition# Do this instead:y = y + 1 # Creates new tensor
Getting Gradient Checking RightWhen verifying your custom autodiff against numerical gradients, the choice of h matters enormously. Too large and the finite difference is inaccurate; too small and floating-point cancellation ruins the result.The standard approach used by PyTorch’s gradcheck:
def gradient_check(f, x, analytical_grad, h=1e-6): """Verify analytical gradient against numerical gradient.""" numerical_grad = (f(x + h) - f(x - h)) / (2 * h) # Use RELATIVE error, not absolute # This handles parameters of different magnitudes numerator = np.abs(analytical_grad - numerical_grad) denominator = max(np.abs(analytical_grad), np.abs(numerical_grad), 1e-8) relative_error = numerator / denominator if relative_error < 1e-5: return "PASS" elif relative_error < 1e-3: return "WARNING -- check for subtle bugs" else: return f"FAIL -- relative error {relative_error:.2e}"
Key points:
Always use centered differences (both +h and -h), not forward differences
Check relative error, not absolute error, because gradients can span many orders of magnitude
Temporarily disable dropout and batch normalization during gradient checking, as they introduce randomness that defeats the numerical comparison
If gradients pass the check with float64 but fail with float32, your code is correct but numerically sensitive — consider using Kahan summation or reorganizing computations to reduce cancellation
Problem: Implement softmax for a vector and compute its gradient.The softmax function is: softmax(xi)=exi/∑jexjThe Jacobian is: ∂si/∂xj=si(δij−sj)
Exercise 3: Implement from Scratch
Problem: Build a complete 2-layer neural network using only your Value class:
Key Takeaway: Automatic differentiation is the engine of deep learning. It’s not magic - it’s just the chain rule applied systematically through a computational graph. Understanding autodiff helps you debug models, implement custom layers, and optimize training.
Explain the difference between symbolic differentiation, numerical differentiation, and automatic differentiation. Why did deep learning frameworks choose autodiff over the other two?
Strong Answer:
Symbolic differentiation applies algebraic rules to produce an exact derivative formula. For f(x) = x^2 * sin(x), it gives f’(x) = 2xsin(x) + x^2cos(x). The problem is expression swell: for complex compositions, the symbolic derivative can be exponentially larger than the original expression. A 10-layer neural network’s loss function, symbolically differentiated, would produce an unmanageably large expression that is slow to evaluate even if you could derive it.
Numerical differentiation uses finite differences: (f(x+h) - f(x-h))/(2h). It is simple and works for any function you can evaluate. But it has two fatal flaws for deep learning: it requires O(n) function evaluations for n parameters (one per parameter), and it suffers from the truncation-cancellation trade-off where no choice of h gives both accuracy and stability simultaneously.
Automatic differentiation computes exact derivatives (to machine precision) at a cost proportional to the original function evaluation. It works by decomposing the computation into elementary operations and applying the chain rule through them. Reverse-mode autodiff (backpropagation) computes the gradient of a scalar output with respect to ALL inputs in a single backward pass, regardless of the number of inputs.
Deep learning chose autodiff because it is the only method that scales. With millions of parameters and a scalar loss, reverse-mode autodiff gives exact gradients in O(1) backward passes. Symbolic differentiation would produce an expression too large to store. Numerical differentiation would require millions of forward passes per gradient step. The cost ratio is not 2x or 10x — it is millions-to-one. This is literally what makes deep learning computationally feasible.
Follow-up: JAX offers both forward-mode and reverse-mode autodiff. Can you describe a scenario where you would compose the two modes?The classic use case is computing Hessian-vector products efficiently. The Hessian of a scalar function is the matrix of second derivatives. Computing the full Hessian is O(n^2) in space, which is prohibitive for large models. But a Hessian-vector product Hv can be computed in O(n) time and space by composing one forward-mode pass inside one reverse-mode pass. First, you use reverse-mode to compute the gradient g(theta). Then, you use forward-mode to compute the Jacobian-vector product of g with respect to theta in direction v, which gives Hv. In JAX, this is jax.jvp(jax.grad(loss), (params,), (v,)). This is used in conjugate gradient methods for second-order optimization, in the TRPO algorithm for reinforcement learning (which needs Hv for the Fisher-vector product), and in computing the spectral norm of the Hessian for loss landscape analysis.
PyTorch uses dynamic computational graphs while TensorFlow 1.x used static graphs. What are the trade-offs, and why did the industry converge toward dynamic graphs?
Strong Answer:
In a static graph system (TensorFlow 1.x), you first define the entire computation as a graph object, then execute it in a separate “session.” The graph is compiled once and reused. This enables aggressive ahead-of-time optimizations: operation fusion, memory planning, dead code elimination, and cross-device placement. The downside is that Python control flow (if/else, loops) cannot be used naturally — you need special graph operations like tf.cond and tf.while_loop.
In a dynamic graph system (PyTorch, JAX in eager mode), the graph is built on-the-fly as Python code executes. Each line of Python immediately computes a value and appends a node to the graph. This means standard Python debugging tools (print statements, pdb, breakpoints) work naturally. Control flow is just regular Python.
The industry converged on dynamic graphs for a simple reason: researcher productivity. In ML research, you spend most of your time writing and debugging new model architectures. The ability to set a breakpoint, inspect intermediate tensors, and step through the computation in a standard debugger dramatically accelerates the research cycle. The performance overhead of dynamic graphs (typically 10-20% slower than optimized static graphs) is acceptable because researcher time is more expensive than GPU time.
The convergence is not absolute. TensorFlow 2.0 adopted eager execution by default but added tf.function for compiling hot paths into static graphs. PyTorch added torch.compile (PyTorch 2.0) which traces the dynamic graph and compiles it for performance. JAX takes a hybrid approach: eager by default, with jax.jit to compile functions. The modern consensus is: develop dynamically, deploy with compilation.
Follow-up: When you use torch.compile or jax.jit, what happens to Python control flow that depends on tensor values?This is where tracing-based compilation gets tricky. When you jit-compile a function, the compiler traces it with concrete input values and records the operations. If your code has a branch like “if x.sum() > 0: do_A() else: do_B()”, the compiler traces whichever branch is taken for the tracing inputs. If a future input takes the other branch, the compiled version is wrong. PyTorch’s torch.compile handles this with “graph breaks” — it detects data-dependent control flow and splits the compiled region, falling back to Python interpretation for the branch, then resuming compilation after. JAX’s jax.jit requires you to explicitly use jax.lax.cond for data-dependent branches, making the control flow visible to the compiler. In practice, most neural network forward passes have minimal data-dependent control flow, so compilation works well for the bulk of the computation.
You need to implement a custom backward pass for a non-standard operation in PyTorch. Walk me through how torch.autograd.Function works and what pitfalls to watch for.
Strong Answer:
You subclass torch.autograd.Function and implement two static methods: forward() and backward(). forward() receives input tensors and a context object, computes the output, and saves any tensors needed for the backward pass using ctx.save_for_backward(). backward() receives the upstream gradient (grad_output) and the context, and must return one gradient per forward input.
The contract is: backward must return tensors with the same shape as the corresponding forward inputs. If an input does not need a gradient (like an integer parameter), return None for that position. Getting the number and order of returned gradients wrong is one of the most common bugs.
Pitfall one: saving too much or too little in the context. If you forget to save an intermediate tensor and try to use a forward-pass local variable in backward, it will be garbage-collected and you get a crash or wrong result. If you save too many large tensors, you waste memory.
Pitfall two: in-place modification of saved tensors. If you save tensor A in the forward pass and then modify A in-place before backward runs, the saved reference points to the modified data, not the original. Always save copies if there is any chance of in-place modification.
Pitfall three: not handling batched inputs. Your backward must work for arbitrary batch sizes. A common bug is implementing the gradient for a single sample and then getting shape errors with batches.
Always validate with torch.autograd.gradcheck, which compares your analytical backward against numerical finite differences. Run this with float64 tensors for maximum precision.
Follow-up: When would you need a custom backward pass instead of relying on PyTorch’s built-in autograd?Three main scenarios. First, when the autodiff-generated backward is numerically unstable but a manually-derived formula is stable. The log-determinant of a matrix is a classic example. Second, when the forward pass involves non-differentiable operations that you want to approximate with a straight-through estimator — for example, quantization (rounding to discrete values) has zero derivative everywhere, but you can define a custom backward that passes the gradient through unchanged. This is essential for training quantized neural networks. Third, memory optimization: sometimes the autodiff graph stores large intermediate tensors that you can avoid by implementing a mathematically equivalent backward that recomputes them from smaller saved quantities.
What is the 'gradient accumulation' bug in PyTorch, and why does the framework not zero gradients automatically?
Strong Answer:
In PyTorch, calling loss.backward() ADDS the computed gradients to the .grad attribute of each parameter rather than replacing them. If you forget to call optimizer.zero_grad() before computing the next batch’s gradients, the gradients from the previous batch are still there, and the new batch’s gradients are added on top. The result is that your effective gradient is the sum of all batches since the last zero_grad(), which is mathematically wrong for standard SGD.
This is the single most common PyTorch bug for beginners. The symptom is that the model appears to train but converges to a worse solution or oscillates erratically. It is insidious because the code does not crash — it just silently computes wrong updates.
Why PyTorch does not zero automatically: gradient accumulation is a deliberate feature, not a bug. It enables training with effective batch sizes larger than what fits in GPU memory. If your GPU can hold batch size 8 but you want the gradient quality of batch size 32, you run 4 forward-backward passes (each with batch 8) without zeroing gradients, then call optimizer.step(). The accumulated gradient is mathematically equivalent to computing the gradient over all 32 samples at once.
The accumulation semantics also enable multi-task and multi-loss training. If you have two losses (classification loss + reconstruction loss), you can call loss1.backward() and loss2.backward() separately, and the gradients accumulate correctly.
Best practice: always call optimizer.zero_grad() at the start of each training iteration (before loss.backward()), not after optimizer.step().
Follow-up: If you are using gradient accumulation to simulate a larger batch size, do you need to scale the loss or gradients?Yes, you need to scale. If you accumulate over K micro-batches, each backward pass adds gradients as if the loss were computed over one micro-batch. The accumulated gradient is K times larger than the gradient from a single forward-backward pass on the full batch (because PyTorch averages over samples within a batch but sums across backward calls). You should divide the loss by K before calling backward (so each backward contributes 1/K of the gradient), or divide the gradient by K after accumulation before optimizer.step(). If you forget to scale, your effective learning rate is K times larger than intended, which can cause instability. The loss division approach is cleaner and more common.