Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Gradients & Multivariable Calculus

Gradients & Multivariable Calculus

Before You Begin: Make sure you’re comfortable with derivatives from the previous module. Gradients are just “derivatives, but more of them.” If you’re shaky on the basics, review Module 1 first!

Why This Module Matters for ML

ML ConceptHow Gradients Are Used
Training neural networksGradient descent adjusts millions of weights simultaneously
BackpropagationComputing gradients through layers of functions
OptimizationFinding the minimum of a loss function with many parameters
Learning rateScales the gradient to control step size
The Big Picture: Neural networks have millions of parameters. Gradients tell us how to adjust ALL of them at once, in the right proportions.

Your Challenge: The CEO’s Dilemma

In the previous module, you optimized one thing (price). But in the real world, you rarely control just one variable. Imagine you’re the CEO of a tech startup. You have two powerful levers to pull:
  1. Price (xx): How much you charge
  2. Ad Spend (yy): How much you spend on marketing
Your Goal: Maximize Profit. The problem is, these variables interact!
  • High price + Low ads = No sales
  • Low price + High ads = Lots of sales, but high costs
  • High price + High ads = Premium brand? Or wasted money?
You are standing on a complex “Profit Landscape” with hills and valleys. You want to find the highest peak (maximum profit). The Catch: You’re blindfolded (or in a thick fog). You can’t see the peak. You can only feel the slope under your feet.

The Hiker in the Fog

Hiker in the Fog - Gradient Intuition
This is the classic intuition for Gradients. Imagine you’re hiking up a mountain in dense fog:
  1. You can’t see the summit.
  2. You want to go up as fast as possible.
  3. What do you do?
You feel the ground with your foot:
  • Step East (xx): Is it going up or down? (Partial Derivative w.r.t xx)
  • Step North (yy): Is it going up or down? (Partial Derivative w.r.t yy)
If East is steep uphill, and North is slightly uphill, you move mostly East, slightly North. The Gradient is your Compass. It combines these two slopes into ONE arrow that points steepest uphill.

What Is a Gradient?

Intuitive Definition

Gradient = The Direction of Steepest Ascent It answers: “Which combination of changes (xx and yy) will increase my output the fastest?”

Mathematical Definition

The gradient (symbol \nabla, pronounced “del” or “nabla”) is just a vector holding all the partial derivatives: f=[fxfy]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}
  • Top number: Slope in xx direction (Price) — “if I change only price, how does profit change?”
  • Bottom number: Slope in yy direction (Ad Spend) — “if I change only ads, how does profit change?”
Breaking Down Partial Derivatives

The Knob Panel Analogy

Here is a concrete analogy that makes gradients click for most people. Picture a recording studio mixing board with hundreds of knobs — volume for each instrument, reverb, bass, treble, and so on. Each knob controls one aspect of the final sound. A partial derivative is what happens when you twist one knob while holding all others still and listen to how the overall quality changes. The gradient is the full set of instructions: “Turn volume up a little, bass down a lot, reverb up slightly…” Each number in the gradient vector tells you how much to twist one specific knob and in which direction. The vector as a whole tells you the single best combination of simultaneous adjustments to improve quality as fast as possible. In ML, each weight in your model is one knob. A model with 175 billion parameters (GPT-4 scale) has 175 billion knobs, and the gradient is a vector with 175 billion entries telling you how to adjust every single one of them in the next training step.

The Key Insight: One Variable at a Time

A partial derivative is just a regular derivative, but you pretend all other variables are constants.
Derivative TypeWhat It MeansExample
dfdx\frac{df}{dx}Regular derivative (1 variable)f(x)=x2f(x) = x^2dfdx=2x\frac{df}{dx} = 2x
fx\frac{\partial f}{\partial x}Partial derivative (treat yy as constant)f(x,y)=x2+3yf(x,y) = x^2 + 3yfx=2x\frac{\partial f}{\partial x} = 2x

Let’s Solve Your CEO Problem

Suppose your Profit function is: P(x,y)=x2y2+10x+8yP(x, y) = -x^2 - y^2 + 10x + 8y Let’s find the gradient at your current position: Price = 2, Ad Spend = 3.
import numpy as np
import matplotlib.pyplot as plt

def profit(x, y):
    """The profit function - a "mountain" with a peak."""
    return -x**2 - y**2 + 10*x + 8*y

def profit_gradient(x, y):
    """
    Compute the gradient of the profit function.
    
    Step 1: Find ∂P/∂x (treat y as constant)
    P = -x² - y² + 10x + 8y
    ∂P/∂x = -2x + 0 + 10 + 0 = -2x + 10
    
    Step 2: Find ∂P/∂y (treat x as constant)
    ∂P/∂y = 0 - 2y + 0 + 8 = -2y + 8
    """
    dp_dx = -2*x + 10  # How profit changes with price
    dp_dy = -2*y + 8   # How profit changes with ads
    
    return np.array([dp_dx, dp_dy])

# Your current strategy
current_price = 2
current_ad_spend = 3

grad = profit_gradient(current_price, current_ad_spend)
current_profit = profit(current_price, current_ad_spend)

print(f"=== CEO DASHBOARD ===")
print(f"Current Position: Price=${current_price}, Ads=${current_ad_spend}")
print(f"Current Profit: ${current_profit}")
print(f"")
print(f"Gradient Vector: {grad}")
print(f"Gradient Vector: {grad}")
Output:
Current Position: Price=$2, Ads=$3
Gradient Vector: [6, 2]
What this tells you:
  • 6 (x-component): Increasing Price is VERY profitable right now.
  • 2 (y-component): Increasing Ad Spend is MILDLY profitable.
  • Decision: You should increase BOTH, but focus 3x more effort on raising Price!
Key Insight: The gradient doesn’t just tell you “up” - it tells you the exact mix of changes to make.

Partial Derivatives: Step-by-Step Guide

Before diving into examples, let’s master the technique of computing partial derivatives.

The Key Rule

To find fx\frac{\partial f}{\partial x}: Treat ALL other variables as constants, then differentiate with respect to xx.

Example 1: Basic Polynomial

f(x,y)=3x2+2xy+y3f(x, y) = 3x^2 + 2xy + y^3 Find fx\frac{\partial f}{\partial x} (treat yy as constant): fx=6x+2y+0=6x+2y\frac{\partial f}{\partial x} = 6x + 2y + 0 = 6x + 2y Find fy\frac{\partial f}{\partial y} (treat xx as constant): fy=0+2x+3y2=2x+3y2\frac{\partial f}{\partial y} = 0 + 2x + 3y^2 = 2x + 3y^2 The Gradient: f=[6x+2y2x+3y2]\nabla f = \begin{bmatrix}6x + 2y \\ 2x + 3y^2\end{bmatrix}
def f(x, y):
    return 3*x**2 + 2*x*y + y**3

def gradient_f(x, y):
    df_dx = 6*x + 2*y
    df_dy = 2*x + 3*y**2
    return np.array([df_dx, df_dy])

# At point (1, 2)
print(gradient_f(1, 2))  # [10, 14]

Example 2: Mixed Terms

g(x,y)=x2y3+exyg(x, y) = x^2 y^3 + e^{xy} Find gx\frac{\partial g}{\partial x}:
  • For x2y3x^2 y^3: treat y3y^3 as constant → 2xy32xy^3
  • For exye^{xy}: chain rule with u=xyu = xyexyye^{xy} \cdot y
gx=2xy3+yexy\frac{\partial g}{\partial x} = 2xy^3 + ye^{xy} Find gy\frac{\partial g}{\partial y}:
  • For x2y3x^2 y^3: treat x2x^2 as constant → 3x2y23x^2y^2
  • For exye^{xy}: chain rule with u=xyu = xyexyxe^{xy} \cdot x
gy=3x2y2+xexy\frac{\partial g}{\partial y} = 3x^2y^2 + xe^{xy}

Example 3: Common ML Functions

Mean Squared Error: L(w,b)=1ni=1n(wxi+byi)2L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2 Find Lw\frac{\partial L}{\partial w}: Lw=2ni=1n(wxi+byi)xi\frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \cdot x_i Find Lb\frac{\partial L}{\partial b}: Lb=2ni=1n(wxi+byi)\frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i)
def mse_gradients(w, b, x, y):
    """Compute gradients of MSE loss"""
    n = len(x)
    predictions = w * x + b
    errors = predictions - y
    
    dL_dw = (2/n) * np.sum(errors * x)
    dL_db = (2/n) * np.sum(errors)
    
    return np.array([dL_dw, dL_db])

# Example: fitting a line
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
w, b = 0.5, 1.0

grad = mse_gradients(w, b, x, y)
print(f"∂L/∂w = {grad[0]:.3f}")
print(f"∂L/∂b = {grad[1]:.3f}")

Partial Derivative Rules Summary

Function Typex\frac{\partial}{\partial x}Example
xnx^nnxn1nx^{n-1}xx3=3x2\frac{\partial}{\partial x}x^3 = 3x^2
yy (constant)00xy2=0\frac{\partial}{\partial x}y^2 = 0
xyxyyyxxy=y\frac{\partial}{\partial x}xy = y
xmynx^m y^nmxm1ynmx^{m-1}y^nxx2y3=2xy3\frac{\partial}{\partial x}x^2y^3 = 2xy^3
exye^{xy}yexyye^{xy}Chain rule: (eu)ux(e^u)' \cdot \frac{\partial u}{\partial x}
ln(xy)\ln(xy)1x\frac{1}{x}1xyy=1x\frac{1}{xy} \cdot y = \frac{1}{x}
sin(xy)\sin(xy)ycos(xy)y\cos(xy)Chain rule

Example 1: Optimizing Your Business

The Problem

Let’s formalize your CEO problem. You want to maximize Revenue based on two investments:
  • xx = Advertising Budget ($1000s)
  • yy = Product Quality Investment ($1000s)
Your Revenue Function: R(x,y)=100x+80yx2y20.5xyR(x, y) = 100x + 80y - x^2 - y^2 - 0.5xy Your Goal: Find the perfect budget allocation (x,yx, y) that maximizes RR.

Visualizing Your Landscape

Here is what your revenue landscape looks like. The gradient (red arrow) shows you the fastest way to the top. Steepest Ascent Visual

Step 1: Compute the Gradient

You need to find the partial derivatives (the slope in each direction):
  1. Slope w.r.t Ad Budget (xx): Rx=1002x0.5y\frac{\partial R}{\partial x} = 100 - 2x - 0.5y (Treat yy as constant number)
  2. Slope w.r.t Quality (yy): Ry=802y0.5x\frac{\partial R}{\partial y} = 80 - 2y - 0.5x (Treat xx as constant number)
Your Gradient Vector: R=[1002x0.5y802y0.5x]\nabla R = \begin{bmatrix} 100 - 2x - 0.5y \\ 80 - 2y - 0.5x \end{bmatrix}

Step 2: Check Your Current Strategy

Suppose you are currently spending:
  • x=20x = 20 ($20k on Ads)
  • y=15y = 15 ($15k on Quality)
Let’s plug these into your gradient:
import numpy as np

def revenue_gradient(x, y):
    dR_dx = 100 - 2*x - 0.5*y
    dR_dy = 80 - 2*y - 0.5*x
    return np.array([dR_dx, dR_dy])

# Your current allocation
x, y = 20, 15
grad = revenue_gradient(x, y)

print(f"Current allocation: Ad=${x}k, Quality=${y}k")
print(f"Gradient: {grad}")
Output:
Current allocation: Ad=$20k, Quality=$15k
Gradient: [52.5, 40.0]
Interpretation:
  • 52.5: Increasing Ad spend is HIGHLY profitable.
  • 40.0: Increasing Quality is ALSO profitable, but slightly less so.
  • Action: Increase both, but prioritize Ads slightly more.

Step 3: Find the Optimal Allocation

To find the absolute peak, you want the point where the slope is ZERO in all directions (flat top). Set Gradient to 0: {1002x0.5y=0802y0.5x=0\begin{cases} 100 - 2x - 0.5y = 0 \\ 80 - 2y - 0.5x = 0 \end{cases} Solving this system (using linear algebra or substitution):
# Solve system of equations
# 2x + 0.5y = 100
# 0.5x + 2y = 80

A = np.array([[2, 0.5], [0.5, 2]])
b = np.array([100, 80])
optimal = np.linalg.solve(A, b)

x_opt, y_opt = optimal
print(f"Optimal Ad Budget: ${x_opt:.2f}k")
print(f"Optimal Quality Budget: ${y_opt:.2f}k")
Output:
Optimal Ad Budget: $42.67k
Optimal Quality Budget: $29.33k
Result: You found the perfect strategy! Spend 42.6konadsand42.6k on ads and 29.3k on quality to maximize revenue. Real Application: Google uses this exact math to optimize ad auctions, balancing multiple metrics (CTR, bid price, user relevance) simultaneously.

Example 2: Optimizing Your Grades

The Problem

You want to maximize your overall GPA across 3 subjects:
  • xx = hours/week on Math
  • yy = hours/week on English
  • zz = hours/week on Science
Your GPA Function: G(x,y,z)=x+y+z0.01(x2+y2+z2)G(x, y, z) = \sqrt{x} + \sqrt{y} + \sqrt{z} - 0.01(x^2 + y^2 + z^2) (Square roots represent learning; squared terms represent burnout/fatigue) Constraint: You only have 30 hours/week total.

Computing Your Gradient

The gradient tells you: “If I add 1 hour of study, which subject gives the biggest GPA boost?”
def gpa_gradient(x, y, z):
    # Partial derivatives (marginal benefit - marginal cost)
    dG_dx = 0.5/np.sqrt(x) - 0.02*x
    dG_dy = 0.5/np.sqrt(y) - 0.02*y
    dG_dz = 0.5/np.sqrt(z) - 0.02*z
    return np.array([dG_dx, dG_dy, dG_dz])

# Your current schedule
x, y, z = 10, 12, 8  # hours per subject

grad = gpa_gradient(x, y, z)

print(f"Current Schedule: Math={x}h, English={y}h, Science={z}h")
print(f"Gradient: {grad}")
Output:
Current Schedule: Math=10h, English=12h, Science=8h
Gradient: [-0.042, -0.096, 0.017]
Interpretation:
  • Math (-0.042): Negative! Studying MORE math will actually LOWER your GPA (burnout).
  • English (-0.096): Very Negative! You are over-studying English.
  • Science (+0.017): Positive! You should shift time to Science.
Action: Study less English/Math, study more Science!

Example 3: Tuning Your Recommendation System

The Problem

You are building a Netflix-style recommender. You have 3 knobs to tune:
  • α\alpha = Recency weight (how much recent views matter)
  • β\beta = Popularity weight (how much overall hits matter)
  • γ\gamma = Personalization weight (how much user history matters)
Your Error Function (Lower is better): E(α,β,γ)=(α0.6)2+(β0.3)2+(γ0.8)2+0.1αβE(\alpha, \beta, \gamma) = (\alpha - 0.6)^2 + (\beta - 0.3)^2 + (\gamma - 0.8)^2 + 0.1\alpha\beta Goal: Find the knob settings that minimize error.

Gradient Descent Optimization

Since we want to MINIMIZE error, we move opposite to the gradient.
def error_gradient(alpha, beta, gamma):
    dE_dalpha = 2*(alpha - 0.6) + 0.1*beta
    dE_dbeta = 2*(beta - 0.3) + 0.1*alpha
    dE_dgamma = 2*(gamma - 0.8)
    return np.array([dE_dalpha, dE_dbeta, dE_dgamma])

# Start with random settings
params = np.array([0.2, 0.5, 0.4])
learning_rate = 0.1

print("Optimizing your system...")
for step in range(15):
    grad = error_gradient(*params)
    params = params - learning_rate * grad  # Move OPPOSITE to gradient
    
    if step % 5 == 0:
        print(f"Step {step}: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")

print(f"Optimal Settings: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")
Real Application: Real recommendation systems optimize thousands of such parameters automatically using this exact method!

Directional Derivatives: Choosing Your Path

The Question

The gradient tells you the steepest way up. But what if you can’t go that way? What if you want to go Northeast? Directional Derivative answers: “How fast will I climb if I walk in THIS specific direction?” Directional Derivative Compass

The Formula

To find the rate of change in direction v\mathbf{v}: Rate=fv(Dot Product)\text{Rate} = \nabla f \cdot \mathbf{v} \quad (\text{Dot Product})
  • If direction is same as gradient → Max rate (Steepest ascent)
  • If direction is perpendicular → Zero rate (Walking flat)
  • If direction is opposite → Negative rate (Steepest descent)
# Gradient at your position
grad = np.array([39, 40])

# You want to move Northeast (45 degrees)
direction = np.array([1, 1]) 
direction = direction / np.linalg.norm(direction) # Normalize length to 1

# How fast will you climb?
rate = np.dot(grad, direction)
print(f"Climbing rate in Northeast direction: {rate:.2f}")
Key Insight: The dot product measures “alignment”. The more your direction aligns with the gradient, the faster you climb!
Numerical Gotcha: Gradient Magnitude in High DimensionsWhen you move from 2D toy examples to real ML models with millions of dimensions, gradient magnitudes can behave in surprising ways. In high dimensions, the gradient vector tends to have a large norm simply because it has so many components. If each component is around 0.01, a gradient vector with 1 million components has a norm of about 0.01×1,000,000=100.01 \times \sqrt{1{,}000{,}000} = 10.This is why gradient clipping is essential in practice. Without it, a sudden spike in a few gradient components can produce an enormous update that destabilizes training:
def clip_gradient(grad, max_norm=1.0):
    """Clip gradient to prevent explosions."""
    grad_norm = np.linalg.norm(grad)
    if grad_norm > max_norm:
        grad = grad * (max_norm / grad_norm)
    return grad
PyTorch’s torch.nn.utils.clip_grad_norm_ does exactly this. Transformer training almost always uses gradient clipping with max_norm=1.0. Without it, you will eventually hit a batch that produces a massive gradient and blows up your weights.

Hessian Matrix (Second Derivatives)

What Is It?

If the gradient is a compass that tells you “which way is steepest,” the Hessian is a topographic survey that tells you “what does the terrain look like in every direction from here?” It is the matrix of all second partial derivatives: H=[2fx22fxy2fyx2fy2]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} Think of it like this: the gradient tells you the slope of the ground, and the Hessian tells you how that slope is changing as you walk. Is the hill getting steeper (you are approaching a valley wall) or flatter (you are approaching the bottom)? The Hessian encodes all of that information.

Why It Matters

Hessian tells you about curvature:
  • Positive definite (all eigenvalues positive) — Local minimum (a bowl)
  • Negative definite (all eigenvalues negative) — Local maximum (a hilltop)
  • Indefinite (mixed positive and negative eigenvalues) — Saddle point (a mountain pass)
ML Connection: Saddle Points in High DimensionsHere is a fact that surprised the deep learning community: in high-dimensional loss landscapes, local minima are rare but saddle points are everywhere. A saddle point requires all eigenvalues of the Hessian to be negative (for a local max) or positive (for a local min). In a space with millions of dimensions, the probability that all eigenvalues have the same sign is astronomically small. Most critical points (where gradient = 0) are saddle points — some directions curve up while others curve down.This is actually good news for optimization. It means gradient descent rarely gets permanently stuck in a bad local minimum. The real challenge is navigating efficiently around saddle points, where gradients are tiny and progress stalls. Momentum-based optimizers like Adam help here because their accumulated velocity carries them through the flat region.
# Example: f(x,y) = x² + y²
def hessian_simple():
    return np.array([
        [2, 0],
        [0, 2]
    ])

H = hessian_simple()
eigenvalues = np.linalg.eigvals(H)
print(f"Eigenvalues: {eigenvalues}")
# Both positive → minimum!

Practice Exercises

Exercise 1: Profit Optimization

# Profit function with 2 variables
def profit(x, y):
    return 50*x + 40*y - x**2 - y**2 - 0.5*x*y

# TODO:
# 1. Compute the gradient
# 2. Find the point where gradient = 0
# 3. Verify it's a maximum using the Hessian

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises show how gradients guide decisions in business, ML, and everyday life.

Exercise 1: Marketing Budget Allocation 📊

A company has a marketing budget to split between Google Ads and Instagram:
import numpy as np

# Conversion rate depends on both channels (they interact!)
# Conversions(g, i) = 100*sqrt(g) + 80*sqrt(i) + 10*sqrt(g*i)
# where g = Google spend ($000s), i = Instagram spend ($000s)
#
# Total budget: $50,000 (g + i = 50)
# Revenue per conversion: $50

# TODO:
# 1. Write the profit function (revenue - costs)
# 2. Compute the gradient ∇Profit
# 3. Find the optimal allocation
# 4. What's the gradient at g=25, i=25? What does it tell you?
import numpy as np

def conversions(g, i):
    """Total conversions from both channels"""
    return 100*np.sqrt(g) + 80*np.sqrt(i) + 10*np.sqrt(g*i)

def profit(g, i, revenue_per_conv=50):
    """Profit = Revenue - Costs"""
    return revenue_per_conv * conversions(g, i) - (g + i) * 1000

def gradient(g, i, revenue_per_conv=50):
    """
    ∂P/∂g = 50 * (50/√g + 5*√(i/g)) - 1000
    ∂P/∂i = 50 * (40/√i + 5*√(g/i)) - 1000
    """
    dP_dg = revenue_per_conv * (50/np.sqrt(g) + 5*np.sqrt(i/g)) - 1000
    dP_di = revenue_per_conv * (40/np.sqrt(i) + 5*np.sqrt(g/i)) - 1000
    return np.array([dP_dg, dP_di])

print("📊 Marketing Budget Optimization")
print("=" * 55)

# Current equal split
g_curr, i_curr = 25, 25
grad = gradient(g_curr, i_curr)
print(f"\n📍 Current Split: Google=${g_curr}k, Instagram=${i_curr}k")
print(f"   Conversions: {conversions(g_curr, i_curr):.0f}")
print(f"   Profit: ${profit(g_curr, i_curr):,.0f}")
print(f"   Gradient: [{grad[0]:.2f}, {grad[1]:.2f}]")
print(f"\n   💡 Interpretation:")
print(f"   • Google marginal value: ${grad[0]:.0f} per $1k extra")
print(f"   • Instagram marginal value: ${grad[1]:.0f} per $1k extra")
if grad[0] > grad[1]:
    print(f"   → Shift budget TO Google!")
else:
    print(f"   → Shift budget TO Instagram!")

# Gradient descent to find optimal (with constraint g + i = 50)
def optimize_constrained():
    g = 25.0
    lr = 0.5
    for _ in range(100):
        grad = gradient(g, 50-g)
        # Move budget based on difference in marginal values
        g = g + lr * (grad[0] - grad[1]) / 2000
        g = np.clip(g, 1, 49)  # Keep valid
    return g, 50-g

g_opt, i_opt = optimize_constrained()
print(f"\n🎯 Optimal Allocation:")
print(f"   Google: ${g_opt:.1f}k ({g_opt/50*100:.0f}%)")
print(f"   Instagram: ${i_opt:.1f}k ({i_opt/50*100:.0f}%)")
print(f"   Profit: ${profit(g_opt, i_opt):,.0f}")
print(f"\n📈 Improvement: +${profit(g_opt, i_opt) - profit(25, 25):,.0f} vs equal split")
Real-World Insight: This is exactly how performance marketing teams at Google, Meta, and agencies optimize ad spend. The gradient tells you where your next dollar is most valuable!

Exercise 2: Neural Network Weight Update 🧠

Manually compute a gradient update for a tiny neural network:
import numpy as np

# Simple network: 2 inputs → 2 weights → 1 output
# y = w1*x1 + w2*x2
# Loss = (y - target)²

# Data point: x = [3, 4], target = 10
# Current weights: w = [1, 1]
# Predicted: y = 1*3 + 1*4 = 7
# Loss = (7 - 10)² = 9

# TODO:
# 1. Compute ∂Loss/∂w1 and ∂Loss/∂w2
# 2. Update weights with learning rate 0.1
# 3. Compute new prediction and loss
# 4. Repeat for 5 steps and watch loss decrease
import numpy as np

def predict(w, x):
    return w[0]*x[0] + w[1]*x[1]

def loss(y_pred, y_true):
    return (y_pred - y_true) ** 2

def gradient(w, x, y_true):
    """
    L = (w1*x1 + w2*x2 - target)²
    ∂L/∂w1 = 2(y_pred - target) * x1
    ∂L/∂w2 = 2(y_pred - target) * x2
    """
    y_pred = predict(w, x)
    error = y_pred - y_true
    return np.array([2 * error * x[0], 2 * error * x[1]])

print("🧠 Neural Network Gradient Descent")
print("=" * 55)

# Setup
x = np.array([3, 4])
target = 10
w = np.array([1.0, 1.0])
lr = 0.1

print(f"Data: x = {x}, target = {target}")
print(f"Initial weights: w = {w}")
print(f"Learning rate: {lr}")
print("\n" + "-" * 55)
print(f"{'Step':<6} {'Weights':<20} {'Pred':<8} {'Loss':<10} {'Gradient'}")
print("-" * 55)

for step in range(6):
    y_pred = predict(w, x)
    L = loss(y_pred, target)
    grad = gradient(w, x, target)
    
    print(f"{step:<6} [{w[0]:.3f}, {w[1]:.3f}]     {y_pred:<8.2f} {L:<10.4f} [{grad[0]:.2f}, {grad[1]:.2f}]")
    
    # Update weights
    w = w - lr * grad

print("-" * 55)
print(f"\n✅ Final weights: w = [{w[0]:.3f}, {w[1]:.3f}]")
print(f"   Prediction: {predict(w, x):.4f} (target was {target})")
print(f"   Loss reduced from 9.0 to {loss(predict(w, x), target):.6f}")

# Verify: perfect weights would be [1, 1.75] (1*3 + 1.75*4 = 10)
print(f"\n💡 Perfect weights: [1.0, 1.75] → {1*3 + 1.75*4}")
Real-World Insight: This is the fundamental update rule in ALL neural network training! PyTorch, TensorFlow, and JAX all do exactly this - just with millions of weights and clever optimizations.

Exercise 3: Heat Map Navigation 🗺️

You’re a robot navigating a temperature field. Find the hottest spot:
import numpy as np

# Temperature field (2D Gaussian peaks)
# T(x, y) = 80*exp(-((x-3)² + (y-2)²)/10) + 60*exp(-((x+2)² + (y+1)²)/5)
# Two heat sources: one at (3, 2), another at (-2, -1)

# You start at position (0, 0)
# Use gradient ascent to find the hottest spot

# TODO:
# 1. Compute the gradient of T
# 2. Implement gradient ascent
# 3. Which heat source do you reach?
# 4. Try different starting positions - do you reach different peaks?
import numpy as np

def temperature(x, y):
    """Two Gaussian heat sources"""
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)  # Peak at (3, 2), max=80
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)   # Peak at (-2, -1), max=60
    return peak1 + peak2

def gradient_T(x, y):
    """Gradient of temperature field"""
    # For peak1: 80*exp(-((x-3)² + (y-2)²)/10)
    # ∂/∂x = 80 * exp(...) * (-2(x-3)/10) = peak1 * (-(x-3)/5)
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)
    
    dT_dx = peak1 * (-(x-3) / 5) + peak2 * (-(x+2) / 2.5)
    dT_dy = peak1 * (-(y-2) / 5) + peak2 * (-(y+1) / 2.5)
    
    return np.array([dT_dx, dT_dy])

def gradient_ascent(start_x, start_y, lr=0.5, steps=50):
    """Climb the temperature gradient"""
    x, y = start_x, start_y
    path = [(x, y, temperature(x, y))]
    
    for _ in range(steps):
        grad = gradient_T(x, y)
        x = x + lr * grad[0]
        y = y + lr * grad[1]
        path.append((x, y, temperature(x, y)))
        
        if np.linalg.norm(grad) < 0.01:
            break
    
    return x, y, path

print("🗺️ Heat Map Navigation (Gradient Ascent)")
print("=" * 55)
print("Heat sources: Peak1 at (3, 2) = 80°C, Peak2 at (-2, -1) = 60°C")

# Test different starting positions
starts = [(0, 0), (5, 0), (-3, 0), (0, 3), (0, -3)]

print("\n📍 Starting Position → Final Position → Peak Reached")
print("-" * 55)

for sx, sy in starts:
    fx, fy, path = gradient_ascent(sx, sy)
    final_temp = temperature(fx, fy)
    
    # Determine which peak
    if fx > 0:
        peak = "Peak1 (80°C)"
    else:
        peak = "Peak2 (60°C)"
    
    print(f"   ({sx:3}, {sy:3}) → ({fx:.1f}, {fy:.1f}) → {peak} at {final_temp:.1f}°C")

# Detailed path from origin
print("\n🚶 Detailed Path from (0, 0):")
_, _, path = gradient_ascent(0, 0)
print("   Step | Position    | Temperature | Gradient")
print("   -----|-------------|-------------|----------")
for i in [0, 5, 10, 20, len(path)-1]:
    if i < len(path):
        x, y, t = path[i]
        g = gradient_T(x, y)
        print(f"   {i:4} | ({x:4.1f}, {y:4.1f}) | {t:11.2f} | ({g[0]:5.2f}, {g[1]:5.2f})")

print("\n💡 Key Insight:")
print("   Gradient ascent finds LOCAL maxima - you reach")
print("   whichever peak you're closest to initially!")
print("   This is why neural networks can get stuck in local minima!")
Real-World Insight: This local vs global optimum problem is fundamental in ML. It’s why we use random initialization, momentum, and techniques like simulated annealing to escape local optima!

Exercise 4: Portfolio Optimization 💼

Find the optimal stock allocation to maximize risk-adjusted return:
import numpy as np

# Two stocks: A (high risk/return) and B (low risk/return)
# Expected return: R(a, b) = 0.15*a + 0.08*b (a, b are allocation fractions)
# Variance (risk): V(a, b) = 0.04*a² + 0.01*b² + 0.01*a*b
# 
# Sharpe ratio (risk-adjusted return): S = R / sqrt(V)
# Constraint: a + b = 1 (fully invested)

# TODO:
# 1. Express S in terms of a only (since b = 1 - a)
# 2. Find the gradient ∂S/∂a
# 3. Find optimal allocation
# 4. Compare with 50/50 split
import numpy as np

def returns(a):
    """Expected return: R = 0.15*a + 0.08*(1-a)"""
    b = 1 - a
    return 0.15 * a + 0.08 * b

def variance(a):
    """Portfolio variance"""
    b = 1 - a
    return 0.04 * a**2 + 0.01 * b**2 + 0.01 * a * b

def sharpe(a):
    """Sharpe ratio = Return / Risk"""
    return returns(a) / np.sqrt(variance(a))

def sharpe_gradient(a, eps=1e-6):
    """Numerical gradient for Sharpe ratio"""
    return (sharpe(a + eps) - sharpe(a - eps)) / (2 * eps)

print("💼 Portfolio Optimization")
print("=" * 55)
print("Stock A: 15% return, 20% volatility (high risk)")
print("Stock B: 8% return, 10% volatility (low risk)")
print("Correlation: 0.5")

# Gradient ascent to find optimal allocation
a = 0.5  # Start at 50/50
lr = 0.5
history = [(a, sharpe(a))]

for _ in range(50):
    grad = sharpe_gradient(a)
    a_new = a + lr * grad
    a = np.clip(a_new, 0, 1)  # Keep valid allocation
    history.append((a, sharpe(a)))
    if abs(grad) < 1e-6:
        break

optimal_a = a

print(f"\n🎯 Optimal Allocation:")
print(f"   Stock A (high risk): {optimal_a*100:.1f}%")
print(f"   Stock B (low risk): {(1-optimal_a)*100:.1f}%")
print(f"   Expected Return: {returns(optimal_a)*100:.2f}%")
print(f"   Portfolio Risk: {np.sqrt(variance(optimal_a))*100:.2f}%")
print(f"   Sharpe Ratio: {sharpe(optimal_a):.4f}")

# Comparison table
print("\n📊 Allocation Comparison:")
print("   Allocation | Return | Risk   | Sharpe")
print("   ----------|--------|--------|--------")
for alloc, label in [(0, "100% B"), (0.5, "50/50"), (optimal_a, "Optimal"), (1, "100% A")]:
    r = returns(alloc)
    v = np.sqrt(variance(alloc))
    s = sharpe(alloc)
    marker = " ←" if abs(alloc - optimal_a) < 0.01 else ""
    print(f"   {label:9} | {r*100:5.1f}% | {v*100:5.1f}% | {s:.4f}{marker}")

print(f"\n💡 Key Insight:")
print(f"   The gradient told us to shift from 50/50 toward higher Stock A")
print(f"   allocation, but not 100% - diversification reduces risk!")
Real-World Insight: This is Modern Portfolio Theory (Markowitz, Nobel Prize 1990). Every robo-advisor (Wealthfront, Betterment) uses gradient-based optimization to find efficient portfolios!

🎯 Practice Problems: Test Your Understanding

Before moving on, make sure you can solve these problems. They’re ordered by difficulty.
Given: f(x,y)=3x2+4xyy2+5f(x, y) = 3x^2 + 4xy - y^2 + 5Find:
  1. fx\frac{\partial f}{\partial x}
  2. fy\frac{\partial f}{\partial y}
  3. f\nabla f at point (1,2)(1, 2)
Step 1: Find fx\frac{\partial f}{\partial x} (treat yy as constant): fx=6x+4y+0+0=6x+4y\frac{\partial f}{\partial x} = 6x + 4y + 0 + 0 = 6x + 4yStep 2: Find fy\frac{\partial f}{\partial y} (treat xx as constant): fy=0+4x2y+0=4x2y\frac{\partial f}{\partial y} = 0 + 4x - 2y + 0 = 4x - 2yStep 3: Evaluate at (1,2)(1, 2): f(1,2)=[6(1)+4(2)4(1)2(2)]=[140]\nabla f(1, 2) = \begin{bmatrix}6(1) + 4(2) \\ 4(1) - 2(2)\end{bmatrix} = \begin{bmatrix}14 \\ 0\end{bmatrix}Interpretation: At point (1, 2), the function increases fastest in the x-direction. The zero in y means changing y alone (at this point) doesn’t change f at first order.
Given: g(x,y)=x2eyg(x, y) = x^2 e^yFind: g\nabla g
Find gx\frac{\partial g}{\partial x}: Treat eye^y as a constant: gx=2xey=2xey\frac{\partial g}{\partial x} = 2x \cdot e^y = 2xe^yFind gy\frac{\partial g}{\partial y}: Treat x2x^2 as a constant: gy=x2ey=x2ey\frac{\partial g}{\partial y} = x^2 \cdot e^y = x^2e^yThe Gradient: g=[2xeyx2ey]=ey[2xx2]\nabla g = \begin{bmatrix}2xe^y \\ x^2e^y\end{bmatrix} = e^y\begin{bmatrix}2x \\ x^2\end{bmatrix}
Given: h(x,y)=x22y2+4x+8y10h(x, y) = -x^2 - 2y^2 + 4x + 8y - 10Find: The point (x,y)(x^*, y^*) where h=0\nabla h = 0 (the critical point).
Step 1: Compute gradient: hx=2x+4\frac{\partial h}{\partial x} = -2x + 4 hy=4y+8\frac{\partial h}{\partial y} = -4y + 8Step 2: Set each component to zero: 2x+4=0    x=2-2x + 4 = 0 \implies x = 2 4y+8=0    y=2-4y + 8 = 0 \implies y = 2The critical point is (2,2)(2, 2).Step 3: Verify it’s a maximum (Hessian check): H=[2004]H = \begin{bmatrix}-2 & 0 \\ 0 & -4\end{bmatrix}Both eigenvalues are negative, so this is indeed a maximum!Value at maximum: h(2,2)=48+8+1610=2h(2, 2) = -4 - 8 + 8 + 16 - 10 = 2
Given: The MSE loss for linear regression with 3 data points:
  • (x1,y1)=(1,3)(x_1, y_1) = (1, 3)
  • (x2,y2)=(2,5)(x_2, y_2) = (2, 5)
  • (x3,y3)=(3,7)(x_3, y_3) = (3, 7)
Model: y^=wx+b\hat{y} = wx + bLoss: L(w,b)=13i=13(y^iyi)2L(w, b) = \frac{1}{3}\sum_{i=1}^{3}(\hat{y}_i - y_i)^2Find: Lw\frac{\partial L}{\partial w} and Lb\frac{\partial L}{\partial b} at w=1,b=1w=1, b=1.
Step 1: Compute predictions at w=1,b=1w=1, b=1:
  • y^1=1(1)+1=2\hat{y}_1 = 1(1) + 1 = 2 (actual: 3, error: -1)
  • y^2=1(2)+1=3\hat{y}_2 = 1(2) + 1 = 3 (actual: 5, error: -2)
  • y^3=1(3)+1=4\hat{y}_3 = 1(3) + 1 = 4 (actual: 7, error: -3)
Step 2: Loss formula expanded: L=13[(w1+b3)2+(w2+b5)2+(w3+b7)2]L = \frac{1}{3}[(w \cdot 1 + b - 3)^2 + (w \cdot 2 + b - 5)^2 + (w \cdot 3 + b - 7)^2]Step 3: Gradient formulas: Lw=23i=13(wxi+byi)xi\frac{\partial L}{\partial w} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i) \cdot x_i Lb=23i=13(wxi+byi)\frac{\partial L}{\partial b} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i)Step 4: Evaluate: Lw=23[(1)(1)+(2)(2)+(3)(3)]=23(149)=23(14)=2839.33\frac{\partial L}{\partial w} = \frac{2}{3}[(-1)(1) + (-2)(2) + (-3)(3)] = \frac{2}{3}(-1 - 4 - 9) = \frac{2}{3}(-14) = -\frac{28}{3} \approx -9.33Lb=23[(1)+(2)+(3)]=23(6)=4\frac{\partial L}{\partial b} = \frac{2}{3}[(-1) + (-2) + (-3)] = \frac{2}{3}(-6) = -4Interpretation: Both gradients are negative, meaning increasing ww and bb will DECREASE the loss (which is what we want!). The true optimal values are w=2,b=1w=2, b=1.

🔑 Key Takeaways

Gradient Essentials:
  • Gradient - Vector of all partial derivatives; ∇f = [∂f/∂x₁, ∂f/∂x₂, …]
  • Direction - Points toward steepest ascent; negate for descent
  • Magnitude - Tells you how steep the slope is at that point
  • Optimization - Critical points where ∇f = 0
  • Hessian - Second derivatives tell if it’s min (positive definite), max, or saddle

Interview Prep: Gradient Questions

Q: What does the gradient represent geometrically?
The gradient points in the direction of steepest increase. Its magnitude indicates how steep that ascent is. For a loss function, we move in the opposite direction (−∇f) to find the minimum.
Q: Why can’t we just set the gradient to zero and solve for neural networks?
Neural networks have millions of parameters with highly non-linear, non-convex loss surfaces. There’s no closed-form solution. We must use iterative gradient descent to find good (local) minima.
Q: What’s the Hessian and when is it useful?
The Hessian is the matrix of second partial derivatives. It tells us about curvature: positive definite = minimum, negative definite = maximum, indefinite = saddle point. Second-order methods use it for faster convergence but are expensive.

Common Pitfalls

Gradient Mistakes to Avoid:
  1. Sign Confusion - Gradient points uphill; for minimization, move in the OPPOSITE direction
  2. Ignoring Multiple Variables - Partial derivatives hold others constant; the gradient combines them all
  3. Assuming Global Optimum - Non-convex functions have local minima; gradient zero doesn’t mean global best
  4. Dimension Mismatch - Gradient has same dimension as input; ∇f : ℝⁿ → ℝⁿ

What’s Next?

You now understand gradients for multi-variable functions. But how do we handle COMPOSITIONS of functions (like neural networks with many layers)? That’s the chain rule - and it’s the key to backpropagation!

Next: Chain Rule & Backpropagation

Discover how neural networks learn through backpropagation

Interview Deep-Dive

Strong Answer:
  • The feasibility comes from reverse-mode automatic differentiation (backpropagation). The key result is that computing the gradient of a scalar loss with respect to ALL parameters costs roughly 2-3x the cost of a single forward pass, regardless of the number of parameters. This is because the backward pass reuses the computational graph structure and intermediate values from the forward pass.
  • If we used numerical differentiation via central differences — (f(w+h) - f(w-h))/(2h) for each parameter — we would need 200 million forward passes (two per parameter). If one forward pass takes 100ms, the numerical gradient would take about 231 days. Backpropagation computes the same gradient in about 200-300ms. That is a speedup factor of roughly 100 million.
  • The mathematical reason this works is the chain rule applied in reverse order. Instead of computing each parameter’s gradient independently, backpropagation shares intermediate computations. The gradient flowing into a layer is computed once and then used to derive gradients for all parameters in that layer simultaneously.
  • In practice, the computational bottleneck is memory, not FLOPS. You need to store all intermediate activations from the forward pass to use during the backward pass. For large models, this is why techniques like gradient checkpointing exist — they trade compute for memory by recomputing some activations during the backward pass instead of storing them.
Follow-up: You mentioned gradient checkpointing trades compute for memory. Quantify that trade-off. When is it worth using?With standard backpropagation, memory scales linearly with the number of layers (store one activation tensor per layer). With gradient checkpointing, you only store activations at certain “checkpoint” layers — say every sqrt(L) layers for L total layers. When the backward pass reaches a segment between checkpoints, it recomputes the forward pass for that segment. The memory drops from O(L) to O(sqrt(L)), but compute increases by roughly 33% (one extra forward pass). For a 100-layer transformer where each layer’s activation is 2GB, standard backprop needs 200GB just for activations, while checkpointing at every 10th layer needs about 20GB plus the recomputation cost. I have seen this be the difference between fitting a model on 8 GPUs versus 32 GPUs, which directly impacts training cost. It is worth using whenever you are memory-bound, which for large language models is almost always.
Strong Answer:
  • The gradient at a point gives you the direction in parameter space where the function increases most rapidly. Geometrically, if you think of the loss surface as a terrain, the gradient is a vector lying in the “horizontal” parameter plane that points directly uphill along the steepest slope.
  • The mathematical proof is elegant: consider all possible unit-length directions u. The directional derivative is the dot product of the gradient with u: D_u(f) = nabla(f) dot u. By the Cauchy-Schwarz inequality, this is maximized when u points in the same direction as the gradient. So the gradient is literally the answer to “which direction maximizes the rate of increase?”
  • The magnitude of the gradient equals the rate of increase in that steepest direction. A gradient magnitude of 100 means the function changes by 100 per unit step in the gradient direction. A magnitude near zero means the surface is nearly flat — you are near a critical point (minimum, maximum, or saddle).
  • For ML, the practical implication is that gradient magnitude gives you a diagnostic signal. If gradient norms are large, you are on a steep part of the loss surface and large steps could overshoot. If they are near zero, you might be converged, stuck at a saddle, or in a flat region. Monitoring gradient norms per layer during training is one of the most useful diagnostics available.
  • A subtlety that trips people up: the gradient lives in parameter space, not in input space. When we say “the gradient of the loss with respect to the weights,” we are describing a direction in weight-space, not a direction in the data.
Follow-up: If the gradient always points toward steepest ascent, why do some people claim that following the negative gradient is not always the best optimization strategy?Because steepest descent in the gradient direction only considers first-order (linear) information about the loss surface. It ignores curvature. Imagine a narrow ravine — the steepest descent direction points across the ravine (the steep walls) rather than along the ravine floor toward the minimum. You end up zig-zagging back and forth. Second-order methods like Newton’s method use the Hessian matrix (curvature information) to find a better direction that accounts for the shape of the valley. The ideal step would be the Hessian-inverse times the gradient, which “straightens” the zig-zag into a direct path. But computing and inverting the Hessian is O(n^2) in space and O(n^3) in time for n parameters, which is prohibitive for large models. That is why we use approximate second-order methods: Adam approximates per-parameter curvature using running averages of squared gradients, and L-BFGS maintains a low-rank Hessian approximation. These give “better than steepest descent” directions at manageable cost.
Strong Answer:
  • The gradient is the first derivative of a scalar-valued function with respect to a vector input. For a loss L with parameters theta in R^n, the gradient is a vector in R^n: nabla(L) = [dL/d(theta_1), …, dL/d(theta_n)]. It tells you the direction and rate of steepest increase. This shows up everywhere in ML — every training step uses it.
  • The Jacobian generalizes the gradient to vector-valued functions. If f maps R^n to R^m, the Jacobian is an m-by-n matrix where entry (i,j) is d(f_i)/d(x_j). In ML, the Jacobian appears in backpropagation through layers: each layer’s output is a vector, and the Jacobian of that layer’s transformation tells you how to propagate gradients backward. The chain rule for neural networks is literally multiplying Jacobian matrices: dL/dx = dL/dy * dy/dx where dy/dx is the Jacobian of the layer.
  • The Hessian is the second derivative of a scalar function — the matrix of all second partial derivatives. Entry (i,j) is d^2L/(d(theta_i) d(theta_j)). It tells you about curvature: how the gradient itself changes as you move in parameter space. Eigenvalues of the Hessian reveal the local geometry. Positive eigenvalues mean you are in a bowl (minimum direction). Negative eigenvalues mean you are on a ridge (maximum direction). A mix means saddle point.
  • Practical usage in ML: Gradients are used every training step. Jacobians are computed implicitly during backpropagation (you never form the full matrix for large networks). Hessians are almost never computed explicitly for large models (too expensive — n^2 entries), but approximations appear in second-order optimizers like K-FAC, natural gradient methods, and in diagnostics like the loss surface sharpness measures used in generalization research.
  • A useful mental model: the gradient is slope, the Hessian is curvature. You need slope to know which way to go. You need curvature to know how far to go and whether your destination is a minimum or a saddle.
Follow-up: If we never compute the full Hessian for large models, how does Adam approximate second-order information, and is its approximation accurate?Adam maintains a running average of squared gradients (the v term), which approximates the diagonal of the absolute Hessian — specifically, the Fisher information matrix diagonal. It divides each gradient component by the square root of this running average, effectively giving each parameter its own adaptive learning rate. Parameters with consistently large gradients (high curvature directions) get smaller effective learning rates, and vice versa. The approximation is crude — it only captures diagonal curvature, ignoring all cross-parameter interactions (off-diagonal Hessian entries). For problems where the Hessian is approximately diagonal (parameters are roughly independent), Adam works very well. For problems with strong parameter correlations, Adam’s diagonal approximation misses important structure. This is why methods like K-FAC (which approximates block-diagonal Hessian structure using Kronecker factors) can outperform Adam on some problems, and why SGD with momentum sometimes generalizes better than Adam for CNNs — the implicit regularization of ignoring curvature can actually be beneficial.
Strong Answer:
  • The most common cause is a feature scaling issue. If one input feature has a range of [0, 1000000] while others are in [0, 1], the partial derivative with respect to the first feature’s associated weight will be proportionally larger because the gradient includes the input activation as a factor. The loss surface becomes a narrow elongated valley — very steep in one direction, very flat in others.
  • The immediate fix is input normalization. Standardize all features to zero mean and unit variance, or scale them to a common range. This makes the loss surface more isotropic (similar curvature in all directions), which dramatically improves gradient descent convergence. Without normalization, you need a very small learning rate to avoid divergence in the steep direction, which makes learning painfully slow in the flat directions.
  • If the data is already normalized and you still see this, it could be a layer normalization issue. In deep networks, activations can grow or shrink across layers. BatchNorm and LayerNorm exist specifically to keep activations and gradients at consistent scales throughout the network.
  • Another possibility: a learning rate that is too large for some parameters but appropriate for others. This is where per-parameter adaptive optimizers like Adam shine — they automatically scale down the learning rate for parameters with large gradients and scale up for those with small gradients.
  • A less obvious cause: a bug in the loss function where one component dominates. For multi-task losses, if one task’s loss is 1000x larger than another’s, its gradients will dominate the update. The fix is loss balancing — either manual scaling or uncertainty-based weighting as in Kendall et al.’s multi-task learning paper.
Follow-up: If you apply BatchNorm to fix gradient scale issues, what new problem does BatchNorm introduce during inference, and how is it handled?During training, BatchNorm computes statistics (mean and variance) from the current mini-batch. During inference, you may process a single sample or a differently-sized batch, so batch statistics are unreliable. The solution is maintaining running exponential moving averages of mean and variance during training, then using those fixed statistics at inference time (model.eval() in PyTorch switches to this mode). The subtle problem is that the running statistics might not match the true data distribution if training data is not shuffled properly, if batch size changes between training and deployment, or if the data distribution shifts in production. I have seen a production model lose 5% accuracy simply because batch statistics diverged from the running averages due to a data pipeline change that altered the feature distribution. This is one reason why LayerNorm (which normalizes per-sample, not per-batch) is preferred in transformers — it has no train/inference discrepancy.