Skip to main content
Gradients & Multivariable Calculus

Gradients & Multivariable Calculus

Before You Begin: Make sure you’re comfortable with derivatives from the previous module. Gradients are just “derivatives, but more of them.” If you’re shaky on the basics, review Module 1 first!

Why This Module Matters for ML

ML ConceptHow Gradients Are Used
Training neural networksGradient descent adjusts millions of weights simultaneously
BackpropagationComputing gradients through layers of functions
OptimizationFinding the minimum of a loss function with many parameters
Learning rateScales the gradient to control step size
The Big Picture: Neural networks have millions of parameters. Gradients tell us how to adjust ALL of them at once, in the right proportions.

Your Challenge: The CEO’s Dilemma

In the previous module, you optimized one thing (price). But in the real world, you rarely control just one variable. Imagine you’re the CEO of a tech startup. You have two powerful levers to pull:
  1. Price (xx): How much you charge
  2. Ad Spend (yy): How much you spend on marketing
Your Goal: Maximize Profit. The problem is, these variables interact!
  • High price + Low ads = No sales
  • Low price + High ads = Lots of sales, but high costs
  • High price + High ads = Premium brand? Or wasted money?
You are standing on a complex “Profit Landscape” with hills and valleys. You want to find the highest peak (maximum profit). The Catch: You’re blindfolded (or in a thick fog). You can’t see the peak. You can only feel the slope under your feet.

The Hiker in the Fog

Hiker in the Fog - Gradient Intuition
This is the classic intuition for Gradients. Imagine you’re hiking up a mountain in dense fog:
  1. You can’t see the summit.
  2. You want to go up as fast as possible.
  3. What do you do?
You feel the ground with your foot:
  • Step East (xx): Is it going up or down? (Partial Derivative w.r.t xx)
  • Step North (yy): Is it going up or down? (Partial Derivative w.r.t yy)
If East is steep uphill, and North is slightly uphill, you move mostly East, slightly North. The Gradient is your Compass. It combines these two slopes into ONE arrow that points steepest uphill.

What Is a Gradient?

Intuitive Definition

Gradient = The Direction of Steepest Ascent It answers: “Which combination of changes (xx and yy) will increase my output the fastest?”

Mathematical Definition

The gradient (symbol \nabla, pronounced “del” or “nabla”) is just a vector holding all the partial derivatives: f=[fxfy]\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}
  • Top number: Slope in xx direction (Price) — “if I change only price, how does profit change?”
  • Bottom number: Slope in yy direction (Ad Spend) — “if I change only ads, how does profit change?”
Breaking Down Partial Derivatives

The Key Insight: One Variable at a Time

A partial derivative is just a regular derivative, but you pretend all other variables are constants.
Derivative TypeWhat It MeansExample
dfdx\frac{df}{dx}Regular derivative (1 variable)f(x)=x2f(x) = x^2dfdx=2x\frac{df}{dx} = 2x
fx\frac{\partial f}{\partial x}Partial derivative (treat yy as constant)f(x,y)=x2+3yf(x,y) = x^2 + 3yfx=2x\frac{\partial f}{\partial x} = 2x

Let’s Solve Your CEO Problem

Suppose your Profit function is: P(x,y)=x2y2+10x+8yP(x, y) = -x^2 - y^2 + 10x + 8y Let’s find the gradient at your current position: Price = 2, Ad Spend = 3.
import numpy as np
import matplotlib.pyplot as plt

def profit(x, y):
    """The profit function - a "mountain" with a peak."""
    return -x**2 - y**2 + 10*x + 8*y

def profit_gradient(x, y):
    """
    Compute the gradient of the profit function.
    
    Step 1: Find ∂P/∂x (treat y as constant)
    P = -x² - y² + 10x + 8y
    ∂P/∂x = -2x + 0 + 10 + 0 = -2x + 10
    
    Step 2: Find ∂P/∂y (treat x as constant)
    ∂P/∂y = 0 - 2y + 0 + 8 = -2y + 8
    """
    dp_dx = -2*x + 10  # How profit changes with price
    dp_dy = -2*y + 8   # How profit changes with ads
    
    return np.array([dp_dx, dp_dy])

# Your current strategy
current_price = 2
current_ad_spend = 3

grad = profit_gradient(current_price, current_ad_spend)
current_profit = profit(current_price, current_ad_spend)

print(f"=== CEO DASHBOARD ===")
print(f"Current Position: Price=${current_price}, Ads=${current_ad_spend}")
print(f"Current Profit: ${current_profit}")
print(f"")
print(f"Gradient Vector: {grad}")
print(f"Gradient Vector: {grad}")
Output:
Current Position: Price=$2, Ads=$3
Gradient Vector: [6, 2]
What this tells you:
  • 6 (x-component): Increasing Price is VERY profitable right now.
  • 2 (y-component): Increasing Ad Spend is MILDLY profitable.
  • Decision: You should increase BOTH, but focus 3x more effort on raising Price!
Key Insight: The gradient doesn’t just tell you “up” - it tells you the exact mix of changes to make.

Partial Derivatives: Step-by-Step Guide

Before diving into examples, let’s master the technique of computing partial derivatives.

The Key Rule

To find fx\frac{\partial f}{\partial x}: Treat ALL other variables as constants, then differentiate with respect to xx.

Example 1: Basic Polynomial

f(x,y)=3x2+2xy+y3f(x, y) = 3x^2 + 2xy + y^3 Find fx\frac{\partial f}{\partial x} (treat yy as constant): fx=6x+2y+0=6x+2y\frac{\partial f}{\partial x} = 6x + 2y + 0 = 6x + 2y Find fy\frac{\partial f}{\partial y} (treat xx as constant): fy=0+2x+3y2=2x+3y2\frac{\partial f}{\partial y} = 0 + 2x + 3y^2 = 2x + 3y^2 The Gradient: f=[6x+2y2x+3y2]\nabla f = \begin{bmatrix}6x + 2y \\ 2x + 3y^2\end{bmatrix}
def f(x, y):
    return 3*x**2 + 2*x*y + y**3

def gradient_f(x, y):
    df_dx = 6*x + 2*y
    df_dy = 2*x + 3*y**2
    return np.array([df_dx, df_dy])

# At point (1, 2)
print(gradient_f(1, 2))  # [10, 14]

Example 2: Mixed Terms

g(x,y)=x2y3+exyg(x, y) = x^2 y^3 + e^{xy} Find gx\frac{\partial g}{\partial x}:
  • For x2y3x^2 y^3: treat y3y^3 as constant → 2xy32xy^3
  • For exye^{xy}: chain rule with u=xyu = xyexyye^{xy} \cdot y
gx=2xy3+yexy\frac{\partial g}{\partial x} = 2xy^3 + ye^{xy} Find gy\frac{\partial g}{\partial y}:
  • For x2y3x^2 y^3: treat x2x^2 as constant → 3x2y23x^2y^2
  • For exye^{xy}: chain rule with u=xyu = xyexyxe^{xy} \cdot x
gy=3x2y2+xexy\frac{\partial g}{\partial y} = 3x^2y^2 + xe^{xy}

Example 3: Common ML Functions

Mean Squared Error: L(w,b)=1ni=1n(wxi+byi)2L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2 Find Lw\frac{\partial L}{\partial w}: Lw=2ni=1n(wxi+byi)xi\frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \cdot x_i Find Lb\frac{\partial L}{\partial b}: Lb=2ni=1n(wxi+byi)\frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i)
def mse_gradients(w, b, x, y):
    """Compute gradients of MSE loss"""
    n = len(x)
    predictions = w * x + b
    errors = predictions - y
    
    dL_dw = (2/n) * np.sum(errors * x)
    dL_db = (2/n) * np.sum(errors)
    
    return np.array([dL_dw, dL_db])

# Example: fitting a line
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
w, b = 0.5, 1.0

grad = mse_gradients(w, b, x, y)
print(f"∂L/∂w = {grad[0]:.3f}")
print(f"∂L/∂b = {grad[1]:.3f}")

Partial Derivative Rules Summary

Function Typex\frac{\partial}{\partial x}Example
xnx^nnxn1nx^{n-1}xx3=3x2\frac{\partial}{\partial x}x^3 = 3x^2
yy (constant)00xy2=0\frac{\partial}{\partial x}y^2 = 0
xyxyyyxxy=y\frac{\partial}{\partial x}xy = y
xmynx^m y^nmxm1ynmx^{m-1}y^nxx2y3=2xy3\frac{\partial}{\partial x}x^2y^3 = 2xy^3
exye^{xy}yexyye^{xy}Chain rule: (eu)ux(e^u)' \cdot \frac{\partial u}{\partial x}
ln(xy)\ln(xy)1x\frac{1}{x}1xyy=1x\frac{1}{xy} \cdot y = \frac{1}{x}
sin(xy)\sin(xy)ycos(xy)y\cos(xy)Chain rule

Example 1: Optimizing Your Business

The Problem

Let’s formalize your CEO problem. You want to maximize Revenue based on two investments:
  • xx = Advertising Budget ($1000s)
  • yy = Product Quality Investment ($1000s)
Your Revenue Function: R(x,y)=100x+80yx2y20.5xyR(x, y) = 100x + 80y - x^2 - y^2 - 0.5xy Your Goal: Find the perfect budget allocation (x,yx, y) that maximizes RR.

Visualizing Your Landscape

Here is what your revenue landscape looks like. The gradient (red arrow) shows you the fastest way to the top. Steepest Ascent Visual

Step 1: Compute the Gradient

You need to find the partial derivatives (the slope in each direction):
  1. Slope w.r.t Ad Budget (xx): Rx=1002x0.5y\frac{\partial R}{\partial x} = 100 - 2x - 0.5y (Treat yy as constant number)
  2. Slope w.r.t Quality (yy): Ry=802y0.5x\frac{\partial R}{\partial y} = 80 - 2y - 0.5x (Treat xx as constant number)
Your Gradient Vector: R=[1002x0.5y802y0.5x]\nabla R = \begin{bmatrix} 100 - 2x - 0.5y \\ 80 - 2y - 0.5x \end{bmatrix}

Step 2: Check Your Current Strategy

Suppose you are currently spending:
  • x=20x = 20 ($20k on Ads)
  • y=15y = 15 ($15k on Quality)
Let’s plug these into your gradient:
import numpy as np

def revenue_gradient(x, y):
    dR_dx = 100 - 2*x - 0.5*y
    dR_dy = 80 - 2*y - 0.5*x
    return np.array([dR_dx, dR_dy])

# Your current allocation
x, y = 20, 15
grad = revenue_gradient(x, y)

print(f"Current allocation: Ad=${x}k, Quality=${y}k")
print(f"Gradient: {grad}")
Output:
Current allocation: Ad=$20k, Quality=$15k
Gradient: [52.5, 40.0]
Interpretation:
  • 52.5: Increasing Ad spend is HIGHLY profitable.
  • 40.0: Increasing Quality is ALSO profitable, but slightly less so.
  • Action: Increase both, but prioritize Ads slightly more.

Step 3: Find the Optimal Allocation

To find the absolute peak, you want the point where the slope is ZERO in all directions (flat top). Set Gradient to 0: {1002x0.5y=0802y0.5x=0\begin{cases} 100 - 2x - 0.5y = 0 \\ 80 - 2y - 0.5x = 0 \end{cases} Solving this system (using linear algebra or substitution):
# Solve system of equations
# 2x + 0.5y = 100
# 0.5x + 2y = 80

A = np.array([[2, 0.5], [0.5, 2]])
b = np.array([100, 80])
optimal = np.linalg.solve(A, b)

x_opt, y_opt = optimal
print(f"Optimal Ad Budget: ${x_opt:.2f}k")
print(f"Optimal Quality Budget: ${y_opt:.2f}k")
Output:
Optimal Ad Budget: $42.67k
Optimal Quality Budget: $29.33k
Result: You found the perfect strategy! Spend 42.6konadsand42.6k on ads and 29.3k on quality to maximize revenue. Real Application: Google uses this exact math to optimize ad auctions, balancing multiple metrics (CTR, bid price, user relevance) simultaneously.

Example 2: Optimizing Your Grades

The Problem

You want to maximize your overall GPA across 3 subjects:
  • xx = hours/week on Math
  • yy = hours/week on English
  • zz = hours/week on Science
Your GPA Function: G(x,y,z)=x+y+z0.01(x2+y2+z2)G(x, y, z) = \sqrt{x} + \sqrt{y} + \sqrt{z} - 0.01(x^2 + y^2 + z^2) (Square roots represent learning; squared terms represent burnout/fatigue) Constraint: You only have 30 hours/week total.

Computing Your Gradient

The gradient tells you: “If I add 1 hour of study, which subject gives the biggest GPA boost?”
def gpa_gradient(x, y, z):
    # Partial derivatives (marginal benefit - marginal cost)
    dG_dx = 0.5/np.sqrt(x) - 0.02*x
    dG_dy = 0.5/np.sqrt(y) - 0.02*y
    dG_dz = 0.5/np.sqrt(z) - 0.02*z
    return np.array([dG_dx, dG_dy, dG_dz])

# Your current schedule
x, y, z = 10, 12, 8  # hours per subject

grad = gpa_gradient(x, y, z)

print(f"Current Schedule: Math={x}h, English={y}h, Science={z}h")
print(f"Gradient: {grad}")
Output:
Current Schedule: Math=10h, English=12h, Science=8h
Gradient: [-0.042, -0.096, 0.017]
Interpretation:
  • Math (-0.042): Negative! Studying MORE math will actually LOWER your GPA (burnout).
  • English (-0.096): Very Negative! You are over-studying English.
  • Science (+0.017): Positive! You should shift time to Science.
Action: Study less English/Math, study more Science!

Example 3: Tuning Your Recommendation System

The Problem

You are building a Netflix-style recommender. You have 3 knobs to tune:
  • α\alpha = Recency weight (how much recent views matter)
  • β\beta = Popularity weight (how much overall hits matter)
  • γ\gamma = Personalization weight (how much user history matters)
Your Error Function (Lower is better): E(α,β,γ)=(α0.6)2+(β0.3)2+(γ0.8)2+0.1αβE(\alpha, \beta, \gamma) = (\alpha - 0.6)^2 + (\beta - 0.3)^2 + (\gamma - 0.8)^2 + 0.1\alpha\beta Goal: Find the knob settings that minimize error.

Gradient Descent Optimization

Since we want to MINIMIZE error, we move opposite to the gradient.
def error_gradient(alpha, beta, gamma):
    dE_dalpha = 2*(alpha - 0.6) + 0.1*beta
    dE_dbeta = 2*(beta - 0.3) + 0.1*alpha
    dE_dgamma = 2*(gamma - 0.8)
    return np.array([dE_dalpha, dE_dbeta, dE_dgamma])

# Start with random settings
params = np.array([0.2, 0.5, 0.4])
learning_rate = 0.1

print("Optimizing your system...")
for step in range(15):
    grad = error_gradient(*params)
    params = params - learning_rate * grad  # Move OPPOSITE to gradient
    
    if step % 5 == 0:
        print(f"Step {step}: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")

print(f"Optimal Settings: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")
Real Application: Real recommendation systems optimize thousands of such parameters automatically using this exact method!

Directional Derivatives: Choosing Your Path

The Question

The gradient tells you the steepest way up. But what if you can’t go that way? What if you want to go Northeast? Directional Derivative answers: “How fast will I climb if I walk in THIS specific direction?” Directional Derivative Compass

The Formula

To find the rate of change in direction v\mathbf{v}: Rate=fv(Dot Product)\text{Rate} = \nabla f \cdot \mathbf{v} \quad (\text{Dot Product})
  • If direction is same as gradient → Max rate (Steepest ascent)
  • If direction is perpendicular → Zero rate (Walking flat)
  • If direction is opposite → Negative rate (Steepest descent)
# Gradient at your position
grad = np.array([39, 40])

# You want to move Northeast (45 degrees)
direction = np.array([1, 1]) 
direction = direction / np.linalg.norm(direction) # Normalize length to 1

# How fast will you climb?
rate = np.dot(grad, direction)
print(f"Climbing rate in Northeast direction: {rate:.2f}")
Key Insight: The dot product measures “alignment”. The more your direction aligns with the gradient, the faster you climb!

Hessian Matrix (Second Derivatives)

What Is It?

Matrix of all second partial derivatives: H=[2fx22fxy2fyx2fy2]H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}

Why It Matters

Hessian tells you about curvature:
  • Positive definite → Local minimum
  • Negative definite → Local maximum
  • Indefinite → Saddle point
# Example: f(x,y) = x² + y²
def hessian_simple():
    return np.array([
        [2, 0],
        [0, 2]
    ])

H = hessian_simple()
eigenvalues = np.linalg.eigvals(H)
print(f"Eigenvalues: {eigenvalues}")
# Both positive → minimum!

Practice Exercises

Exercise 1: Profit Optimization

# Profit function with 2 variables
def profit(x, y):
    return 50*x + 40*y - x**2 - y**2 - 0.5*x*y

# TODO:
# 1. Compute the gradient
# 2. Find the point where gradient = 0
# 3. Verify it's a maximum using the Hessian

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises show how gradients guide decisions in business, ML, and everyday life.

Exercise 1: Marketing Budget Allocation 📊

A company has a marketing budget to split between Google Ads and Instagram:
import numpy as np

# Conversion rate depends on both channels (they interact!)
# Conversions(g, i) = 100*sqrt(g) + 80*sqrt(i) + 10*sqrt(g*i)
# where g = Google spend ($000s), i = Instagram spend ($000s)
#
# Total budget: $50,000 (g + i = 50)
# Revenue per conversion: $50

# TODO:
# 1. Write the profit function (revenue - costs)
# 2. Compute the gradient ∇Profit
# 3. Find the optimal allocation
# 4. What's the gradient at g=25, i=25? What does it tell you?
import numpy as np

def conversions(g, i):
    """Total conversions from both channels"""
    return 100*np.sqrt(g) + 80*np.sqrt(i) + 10*np.sqrt(g*i)

def profit(g, i, revenue_per_conv=50):
    """Profit = Revenue - Costs"""
    return revenue_per_conv * conversions(g, i) - (g + i) * 1000

def gradient(g, i, revenue_per_conv=50):
    """
    ∂P/∂g = 50 * (50/√g + 5*√(i/g)) - 1000
    ∂P/∂i = 50 * (40/√i + 5*√(g/i)) - 1000
    """
    dP_dg = revenue_per_conv * (50/np.sqrt(g) + 5*np.sqrt(i/g)) - 1000
    dP_di = revenue_per_conv * (40/np.sqrt(i) + 5*np.sqrt(g/i)) - 1000
    return np.array([dP_dg, dP_di])

print("📊 Marketing Budget Optimization")
print("=" * 55)

# Current equal split
g_curr, i_curr = 25, 25
grad = gradient(g_curr, i_curr)
print(f"\n📍 Current Split: Google=${g_curr}k, Instagram=${i_curr}k")
print(f"   Conversions: {conversions(g_curr, i_curr):.0f}")
print(f"   Profit: ${profit(g_curr, i_curr):,.0f}")
print(f"   Gradient: [{grad[0]:.2f}, {grad[1]:.2f}]")
print(f"\n   💡 Interpretation:")
print(f"   • Google marginal value: ${grad[0]:.0f} per $1k extra")
print(f"   • Instagram marginal value: ${grad[1]:.0f} per $1k extra")
if grad[0] > grad[1]:
    print(f"   → Shift budget TO Google!")
else:
    print(f"   → Shift budget TO Instagram!")

# Gradient descent to find optimal (with constraint g + i = 50)
def optimize_constrained():
    g = 25.0
    lr = 0.5
    for _ in range(100):
        grad = gradient(g, 50-g)
        # Move budget based on difference in marginal values
        g = g + lr * (grad[0] - grad[1]) / 2000
        g = np.clip(g, 1, 49)  # Keep valid
    return g, 50-g

g_opt, i_opt = optimize_constrained()
print(f"\n🎯 Optimal Allocation:")
print(f"   Google: ${g_opt:.1f}k ({g_opt/50*100:.0f}%)")
print(f"   Instagram: ${i_opt:.1f}k ({i_opt/50*100:.0f}%)")
print(f"   Profit: ${profit(g_opt, i_opt):,.0f}")
print(f"\n📈 Improvement: +${profit(g_opt, i_opt) - profit(25, 25):,.0f} vs equal split")
Real-World Insight: This is exactly how performance marketing teams at Google, Meta, and agencies optimize ad spend. The gradient tells you where your next dollar is most valuable!

Exercise 2: Neural Network Weight Update 🧠

Manually compute a gradient update for a tiny neural network:
import numpy as np

# Simple network: 2 inputs → 2 weights → 1 output
# y = w1*x1 + w2*x2
# Loss = (y - target)²

# Data point: x = [3, 4], target = 10
# Current weights: w = [1, 1]
# Predicted: y = 1*3 + 1*4 = 7
# Loss = (7 - 10)² = 9

# TODO:
# 1. Compute ∂Loss/∂w1 and ∂Loss/∂w2
# 2. Update weights with learning rate 0.1
# 3. Compute new prediction and loss
# 4. Repeat for 5 steps and watch loss decrease
import numpy as np

def predict(w, x):
    return w[0]*x[0] + w[1]*x[1]

def loss(y_pred, y_true):
    return (y_pred - y_true) ** 2

def gradient(w, x, y_true):
    """
    L = (w1*x1 + w2*x2 - target)²
    ∂L/∂w1 = 2(y_pred - target) * x1
    ∂L/∂w2 = 2(y_pred - target) * x2
    """
    y_pred = predict(w, x)
    error = y_pred - y_true
    return np.array([2 * error * x[0], 2 * error * x[1]])

print("🧠 Neural Network Gradient Descent")
print("=" * 55)

# Setup
x = np.array([3, 4])
target = 10
w = np.array([1.0, 1.0])
lr = 0.1

print(f"Data: x = {x}, target = {target}")
print(f"Initial weights: w = {w}")
print(f"Learning rate: {lr}")
print("\n" + "-" * 55)
print(f"{'Step':<6} {'Weights':<20} {'Pred':<8} {'Loss':<10} {'Gradient'}")
print("-" * 55)

for step in range(6):
    y_pred = predict(w, x)
    L = loss(y_pred, target)
    grad = gradient(w, x, target)
    
    print(f"{step:<6} [{w[0]:.3f}, {w[1]:.3f}]     {y_pred:<8.2f} {L:<10.4f} [{grad[0]:.2f}, {grad[1]:.2f}]")
    
    # Update weights
    w = w - lr * grad

print("-" * 55)
print(f"\n✅ Final weights: w = [{w[0]:.3f}, {w[1]:.3f}]")
print(f"   Prediction: {predict(w, x):.4f} (target was {target})")
print(f"   Loss reduced from 9.0 to {loss(predict(w, x), target):.6f}")

# Verify: perfect weights would be [1, 1.75] (1*3 + 1.75*4 = 10)
print(f"\n💡 Perfect weights: [1.0, 1.75] → {1*3 + 1.75*4}")
Real-World Insight: This is the fundamental update rule in ALL neural network training! PyTorch, TensorFlow, and JAX all do exactly this - just with millions of weights and clever optimizations.

Exercise 3: Heat Map Navigation 🗺️

You’re a robot navigating a temperature field. Find the hottest spot:
import numpy as np

# Temperature field (2D Gaussian peaks)
# T(x, y) = 80*exp(-((x-3)² + (y-2)²)/10) + 60*exp(-((x+2)² + (y+1)²)/5)
# Two heat sources: one at (3, 2), another at (-2, -1)

# You start at position (0, 0)
# Use gradient ascent to find the hottest spot

# TODO:
# 1. Compute the gradient of T
# 2. Implement gradient ascent
# 3. Which heat source do you reach?
# 4. Try different starting positions - do you reach different peaks?
import numpy as np

def temperature(x, y):
    """Two Gaussian heat sources"""
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)  # Peak at (3, 2), max=80
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)   # Peak at (-2, -1), max=60
    return peak1 + peak2

def gradient_T(x, y):
    """Gradient of temperature field"""
    # For peak1: 80*exp(-((x-3)² + (y-2)²)/10)
    # ∂/∂x = 80 * exp(...) * (-2(x-3)/10) = peak1 * (-(x-3)/5)
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)
    
    dT_dx = peak1 * (-(x-3) / 5) + peak2 * (-(x+2) / 2.5)
    dT_dy = peak1 * (-(y-2) / 5) + peak2 * (-(y+1) / 2.5)
    
    return np.array([dT_dx, dT_dy])

def gradient_ascent(start_x, start_y, lr=0.5, steps=50):
    """Climb the temperature gradient"""
    x, y = start_x, start_y
    path = [(x, y, temperature(x, y))]
    
    for _ in range(steps):
        grad = gradient_T(x, y)
        x = x + lr * grad[0]
        y = y + lr * grad[1]
        path.append((x, y, temperature(x, y)))
        
        if np.linalg.norm(grad) < 0.01:
            break
    
    return x, y, path

print("🗺️ Heat Map Navigation (Gradient Ascent)")
print("=" * 55)
print("Heat sources: Peak1 at (3, 2) = 80°C, Peak2 at (-2, -1) = 60°C")

# Test different starting positions
starts = [(0, 0), (5, 0), (-3, 0), (0, 3), (0, -3)]

print("\n📍 Starting Position → Final Position → Peak Reached")
print("-" * 55)

for sx, sy in starts:
    fx, fy, path = gradient_ascent(sx, sy)
    final_temp = temperature(fx, fy)
    
    # Determine which peak
    if fx > 0:
        peak = "Peak1 (80°C)"
    else:
        peak = "Peak2 (60°C)"
    
    print(f"   ({sx:3}, {sy:3}) → ({fx:.1f}, {fy:.1f}) → {peak} at {final_temp:.1f}°C")

# Detailed path from origin
print("\n🚶 Detailed Path from (0, 0):")
_, _, path = gradient_ascent(0, 0)
print("   Step | Position    | Temperature | Gradient")
print("   -----|-------------|-------------|----------")
for i in [0, 5, 10, 20, len(path)-1]:
    if i < len(path):
        x, y, t = path[i]
        g = gradient_T(x, y)
        print(f"   {i:4} | ({x:4.1f}, {y:4.1f}) | {t:11.2f} | ({g[0]:5.2f}, {g[1]:5.2f})")

print("\n💡 Key Insight:")
print("   Gradient ascent finds LOCAL maxima - you reach")
print("   whichever peak you're closest to initially!")
print("   This is why neural networks can get stuck in local minima!")
Real-World Insight: This local vs global optimum problem is fundamental in ML. It’s why we use random initialization, momentum, and techniques like simulated annealing to escape local optima!

Exercise 4: Portfolio Optimization 💼

Find the optimal stock allocation to maximize risk-adjusted return:
import numpy as np

# Two stocks: A (high risk/return) and B (low risk/return)
# Expected return: R(a, b) = 0.15*a + 0.08*b (a, b are allocation fractions)
# Variance (risk): V(a, b) = 0.04*a² + 0.01*b² + 0.01*a*b
# 
# Sharpe ratio (risk-adjusted return): S = R / sqrt(V)
# Constraint: a + b = 1 (fully invested)

# TODO:
# 1. Express S in terms of a only (since b = 1 - a)
# 2. Find the gradient ∂S/∂a
# 3. Find optimal allocation
# 4. Compare with 50/50 split
import numpy as np

def returns(a):
    """Expected return: R = 0.15*a + 0.08*(1-a)"""
    b = 1 - a
    return 0.15 * a + 0.08 * b

def variance(a):
    """Portfolio variance"""
    b = 1 - a
    return 0.04 * a**2 + 0.01 * b**2 + 0.01 * a * b

def sharpe(a):
    """Sharpe ratio = Return / Risk"""
    return returns(a) / np.sqrt(variance(a))

def sharpe_gradient(a, eps=1e-6):
    """Numerical gradient for Sharpe ratio"""
    return (sharpe(a + eps) - sharpe(a - eps)) / (2 * eps)

print("💼 Portfolio Optimization")
print("=" * 55)
print("Stock A: 15% return, 20% volatility (high risk)")
print("Stock B: 8% return, 10% volatility (low risk)")
print("Correlation: 0.5")

# Gradient ascent to find optimal allocation
a = 0.5  # Start at 50/50
lr = 0.5
history = [(a, sharpe(a))]

for _ in range(50):
    grad = sharpe_gradient(a)
    a_new = a + lr * grad
    a = np.clip(a_new, 0, 1)  # Keep valid allocation
    history.append((a, sharpe(a)))
    if abs(grad) < 1e-6:
        break

optimal_a = a

print(f"\n🎯 Optimal Allocation:")
print(f"   Stock A (high risk): {optimal_a*100:.1f}%")
print(f"   Stock B (low risk): {(1-optimal_a)*100:.1f}%")
print(f"   Expected Return: {returns(optimal_a)*100:.2f}%")
print(f"   Portfolio Risk: {np.sqrt(variance(optimal_a))*100:.2f}%")
print(f"   Sharpe Ratio: {sharpe(optimal_a):.4f}")

# Comparison table
print("\n📊 Allocation Comparison:")
print("   Allocation | Return | Risk   | Sharpe")
print("   ----------|--------|--------|--------")
for alloc, label in [(0, "100% B"), (0.5, "50/50"), (optimal_a, "Optimal"), (1, "100% A")]:
    r = returns(alloc)
    v = np.sqrt(variance(alloc))
    s = sharpe(alloc)
    marker = " ←" if abs(alloc - optimal_a) < 0.01 else ""
    print(f"   {label:9} | {r*100:5.1f}% | {v*100:5.1f}% | {s:.4f}{marker}")

print(f"\n💡 Key Insight:")
print(f"   The gradient told us to shift from 50/50 toward higher Stock A")
print(f"   allocation, but not 100% - diversification reduces risk!")
Real-World Insight: This is Modern Portfolio Theory (Markowitz, Nobel Prize 1990). Every robo-advisor (Wealthfront, Betterment) uses gradient-based optimization to find efficient portfolios!

🎯 Practice Problems: Test Your Understanding

Before moving on, make sure you can solve these problems. They’re ordered by difficulty.
Given: f(x,y)=3x2+4xyy2+5f(x, y) = 3x^2 + 4xy - y^2 + 5Find:
  1. fx\frac{\partial f}{\partial x}
  2. fy\frac{\partial f}{\partial y}
  3. f\nabla f at point (1,2)(1, 2)
Step 1: Find fx\frac{\partial f}{\partial x} (treat yy as constant): fx=6x+4y+0+0=6x+4y\frac{\partial f}{\partial x} = 6x + 4y + 0 + 0 = 6x + 4yStep 2: Find fy\frac{\partial f}{\partial y} (treat xx as constant): fy=0+4x2y+0=4x2y\frac{\partial f}{\partial y} = 0 + 4x - 2y + 0 = 4x - 2yStep 3: Evaluate at (1,2)(1, 2): f(1,2)=[6(1)+4(2)4(1)2(2)]=[140]\nabla f(1, 2) = \begin{bmatrix}6(1) + 4(2) \\ 4(1) - 2(2)\end{bmatrix} = \begin{bmatrix}14 \\ 0\end{bmatrix}Interpretation: At point (1, 2), the function increases fastest in the x-direction. The zero in y means changing y alone (at this point) doesn’t change f at first order.
Given: g(x,y)=x2eyg(x, y) = x^2 e^yFind: g\nabla g
Find gx\frac{\partial g}{\partial x}: Treat eye^y as a constant: gx=2xey=2xey\frac{\partial g}{\partial x} = 2x \cdot e^y = 2xe^yFind gy\frac{\partial g}{\partial y}: Treat x2x^2 as a constant: gy=x2ey=x2ey\frac{\partial g}{\partial y} = x^2 \cdot e^y = x^2e^yThe Gradient: g=[2xeyx2ey]=ey[2xx2]\nabla g = \begin{bmatrix}2xe^y \\ x^2e^y\end{bmatrix} = e^y\begin{bmatrix}2x \\ x^2\end{bmatrix}
Given: h(x,y)=x22y2+4x+8y10h(x, y) = -x^2 - 2y^2 + 4x + 8y - 10Find: The point (x,y)(x^*, y^*) where h=0\nabla h = 0 (the critical point).
Step 1: Compute gradient: hx=2x+4\frac{\partial h}{\partial x} = -2x + 4 hy=4y+8\frac{\partial h}{\partial y} = -4y + 8Step 2: Set each component to zero: 2x+4=0    x=2-2x + 4 = 0 \implies x = 2 4y+8=0    y=2-4y + 8 = 0 \implies y = 2The critical point is (2,2)(2, 2).Step 3: Verify it’s a maximum (Hessian check): H=[2004]H = \begin{bmatrix}-2 & 0 \\ 0 & -4\end{bmatrix}Both eigenvalues are negative, so this is indeed a maximum!Value at maximum: h(2,2)=48+8+1610=2h(2, 2) = -4 - 8 + 8 + 16 - 10 = 2
Given: The MSE loss for linear regression with 3 data points:
  • (x1,y1)=(1,3)(x_1, y_1) = (1, 3)
  • (x2,y2)=(2,5)(x_2, y_2) = (2, 5)
  • (x3,y3)=(3,7)(x_3, y_3) = (3, 7)
Model: y^=wx+b\hat{y} = wx + bLoss: L(w,b)=13i=13(y^iyi)2L(w, b) = \frac{1}{3}\sum_{i=1}^{3}(\hat{y}_i - y_i)^2Find: Lw\frac{\partial L}{\partial w} and Lb\frac{\partial L}{\partial b} at w=1,b=1w=1, b=1.
Step 1: Compute predictions at w=1,b=1w=1, b=1:
  • y^1=1(1)+1=2\hat{y}_1 = 1(1) + 1 = 2 (actual: 3, error: -1)
  • y^2=1(2)+1=3\hat{y}_2 = 1(2) + 1 = 3 (actual: 5, error: -2)
  • y^3=1(3)+1=4\hat{y}_3 = 1(3) + 1 = 4 (actual: 7, error: -3)
Step 2: Loss formula expanded: L=13[(w1+b3)2+(w2+b5)2+(w3+b7)2]L = \frac{1}{3}[(w \cdot 1 + b - 3)^2 + (w \cdot 2 + b - 5)^2 + (w \cdot 3 + b - 7)^2]Step 3: Gradient formulas: Lw=23i=13(wxi+byi)xi\frac{\partial L}{\partial w} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i) \cdot x_i Lb=23i=13(wxi+byi)\frac{\partial L}{\partial b} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i)Step 4: Evaluate: Lw=23[(1)(1)+(2)(2)+(3)(3)]=23(149)=23(14)=2839.33\frac{\partial L}{\partial w} = \frac{2}{3}[(-1)(1) + (-2)(2) + (-3)(3)] = \frac{2}{3}(-1 - 4 - 9) = \frac{2}{3}(-14) = -\frac{28}{3} \approx -9.33Lb=23[(1)+(2)+(3)]=23(6)=4\frac{\partial L}{\partial b} = \frac{2}{3}[(-1) + (-2) + (-3)] = \frac{2}{3}(-6) = -4Interpretation: Both gradients are negative, meaning increasing ww and bb will DECREASE the loss (which is what we want!). The true optimal values are w=2,b=1w=2, b=1.

🔑 Key Takeaways

Gradient Essentials:
  • Gradient - Vector of all partial derivatives; ∇f = [∂f/∂x₁, ∂f/∂x₂, …]
  • Direction - Points toward steepest ascent; negate for descent
  • Magnitude - Tells you how steep the slope is at that point
  • Optimization - Critical points where ∇f = 0
  • Hessian - Second derivatives tell if it’s min (positive definite), max, or saddle

Interview Prep: Gradient Questions

Q: What does the gradient represent geometrically?
The gradient points in the direction of steepest increase. Its magnitude indicates how steep that ascent is. For a loss function, we move in the opposite direction (−∇f) to find the minimum.
Q: Why can’t we just set the gradient to zero and solve for neural networks?
Neural networks have millions of parameters with highly non-linear, non-convex loss surfaces. There’s no closed-form solution. We must use iterative gradient descent to find good (local) minima.
Q: What’s the Hessian and when is it useful?
The Hessian is the matrix of second partial derivatives. It tells us about curvature: positive definite = minimum, negative definite = maximum, indefinite = saddle point. Second-order methods use it for faster convergence but are expensive.

Common Pitfalls

Gradient Mistakes to Avoid:
  1. Sign Confusion - Gradient points uphill; for minimization, move in the OPPOSITE direction
  2. Ignoring Multiple Variables - Partial derivatives hold others constant; the gradient combines them all
  3. Assuming Global Optimum - Non-convex functions have local minima; gradient zero doesn’t mean global best
  4. Dimension Mismatch - Gradient has same dimension as input; ∇f : ℝⁿ → ℝⁿ

What’s Next?

You now understand gradients for multi-variable functions. But how do we handle COMPOSITIONS of functions (like neural networks with many layers)? That’s the chain rule - and it’s the key to backpropagation!

Next: Chain Rule & Backpropagation

Discover how neural networks learn through backpropagation