Gradients & Multivariable Calculus

Before You Begin: Make sure you’re comfortable with derivatives from the previous module. Gradients are just “derivatives, but more of them.” If you’re shaky on the basics, review Module 1 first!

Why This Module Matters for ML

ML Concept	How Gradients Are Used
Training neural networks	Gradient descent adjusts millions of weights simultaneously
Backpropagation	Computing gradients through layers of functions
Optimization	Finding the minimum of a loss function with many parameters
Learning rate	Scales the gradient to control step size

The Big Picture: Neural networks have millions of parameters. Gradients tell us how to adjust ALL of them at once, in the right proportions.

Your Challenge: The CEO’s Dilemma

In the previous module, you optimized one thing (price). But in the real world, you rarely control just one variable. Imagine you’re the CEO of a tech startup. You have two powerful levers to pull:

Price ( $x$ ): How much you charge
Ad Spend ( $y$ ): How much you spend on marketing

Your Goal: Maximize Profit. The problem is, these variables interact!

High price + Low ads = No sales
Low price + High ads = Lots of sales, but high costs
High price + High ads = Premium brand? Or wasted money?

You are standing on a complex “Profit Landscape” with hills and valleys. You want to find the highest peak (maximum profit). The Catch: You’re blindfolded (or in a thick fog). You can’t see the peak. You can only feel the slope under your feet.

The Hiker in the Fog

This is the classic intuition for Gradients. Imagine you’re hiking up a mountain in dense fog:

You can’t see the summit.
You want to go up as fast as possible.
What do you do?

You feel the ground with your foot:

Step East ( $x$ ): Is it going up or down? (Partial Derivative w.r.t $x$ )
Step North ( $y$ ): Is it going up or down? (Partial Derivative w.r.t $y$ )

If East is steep uphill, and North is slightly uphill, you move mostly East, slightly North. The Gradient is your Compass. It combines these two slopes into ONE arrow that points steepest uphill.

What Is a Gradient?

Intuitive Definition

Gradient = The Direction of Steepest Ascent It answers: “Which combination of changes ( $x$ and $y$ ) will increase my output the fastest?”

Mathematical Definition

The gradient (symbol

\nabla

, pronounced “del” or “nabla”) is just a vector holding all the partial derivatives:

\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}

Top number: Slope in $x$ direction (Price) — “if I change only price, how does profit change?”
Bottom number: Slope in $y$ direction (Ad Spend) — “if I change only ads, how does profit change?”

The Key Insight: One Variable at a Time

A partial derivative is just a regular derivative, but you pretend all other variables are constants.

Derivative Type	What It Means	Example
$\frac{df}{dx}$	Regular derivative (1 variable)	$f(x) = x^2$ → $\frac{df}{dx} = 2x$
$\frac{\partial f}{\partial x}$	Partial derivative (treat $y$ as constant)	$f(x,y) = x^2 + 3y$ → $\frac{\partial f}{\partial x} = 2x$

Let’s Solve Your CEO Problem

Suppose your Profit function is:

P(x, y) = -x^2 - y^2 + 10x + 8y

Let’s find the gradient at your current position: Price = 2, Ad Spend = 3.

import numpy as np
import matplotlib.pyplot as plt

def profit(x, y):
    """The profit function - a "mountain" with a peak."""
    return -x**2 - y**2 + 10*x + 8*y

def profit_gradient(x, y):
    """
    Compute the gradient of the profit function.
    
    Step 1: Find ∂P/∂x (treat y as constant)
    P = -x² - y² + 10x + 8y
    ∂P/∂x = -2x + 0 + 10 + 0 = -2x + 10
    
    Step 2: Find ∂P/∂y (treat x as constant)
    ∂P/∂y = 0 - 2y + 0 + 8 = -2y + 8
    """
    dp_dx = -2*x + 10  # How profit changes with price
    dp_dy = -2*y + 8   # How profit changes with ads
    
    return np.array([dp_dx, dp_dy])

# Your current strategy
current_price = 2
current_ad_spend = 3

grad = profit_gradient(current_price, current_ad_spend)
current_profit = profit(current_price, current_ad_spend)

print(f"=== CEO DASHBOARD ===")
print(f"Current Position: Price=${current_price}, Ads=${current_ad_spend}")
print(f"Current Profit: ${current_profit}")
print(f"")
print(f"Gradient Vector: {grad}")
print(f"Gradient Vector: {grad}")

Output:

Current Position: Price=$2, Ads=$3
Gradient Vector: [6, 2]

What this tells you:

6 (x-component): Increasing Price is VERY profitable right now.
2 (y-component): Increasing Ad Spend is MILDLY profitable.
Decision: You should increase BOTH, but focus 3x more effort on raising Price!

Key Insight: The gradient doesn’t just tell you “up” - it tells you the exact mix of changes to make.

Partial Derivatives: Step-by-Step Guide

Before diving into examples, let’s master the technique of computing partial derivatives.

The Key Rule

To find $\frac{\partial f}{\partial x}$ : Treat ALL other variables as constants, then differentiate with respect to

x

Example 1: Basic Polynomial

f(x, y) = 3x^2 + 2xy + y^3

Find $\frac{\partial f}{\partial x}$ (treat

y

as constant):

\frac{\partial f}{\partial x} = 6x + 2y + 0 = 6x + 2y

Find $\frac{\partial f}{\partial y}$ (treat

x

as constant):

\frac{\partial f}{\partial y} = 0 + 2x + 3y^2 = 2x + 3y^2

The Gradient:

\nabla f = \begin{bmatrix}6x + 2y \\ 2x + 3y^2\end{bmatrix}

def f(x, y):
    return 3*x**2 + 2*x*y + y**3

def gradient_f(x, y):
    df_dx = 6*x + 2*y
    df_dy = 2*x + 3*y**2
    return np.array([df_dx, df_dy])

# At point (1, 2)
print(gradient_f(1, 2))  # [10, 14]

Example 2: Mixed Terms

g(x, y) = x^2 y^3 + e^{xy}

Find $\frac{\partial g}{\partial x}$ :

For $x^2 y^3$ : treat $y^3$ as constant → $2xy^3$
For $e^{xy}$ : chain rule with $u = xy$ → $e^{xy} \cdot y$

\frac{\partial g}{\partial x} = 2xy^3 + ye^{xy}

Find $\frac{\partial g}{\partial y}$ :

For $x^2 y^3$ : treat $x^2$ as constant → $3x^2y^2$
For $e^{xy}$ : chain rule with $u = xy$ → $e^{xy} \cdot x$

\frac{\partial g}{\partial y} = 3x^2y^2 + xe^{xy}

Example 3: Common ML Functions

Mean Squared Error:

L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(wx_i + b - y_i)^2

Find $\frac{\partial L}{\partial w}$ :

\frac{\partial L}{\partial w} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i) \cdot x_i

Find $\frac{\partial L}{\partial b}$ :

\frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i=1}^{n}(wx_i + b - y_i)

def mse_gradients(w, b, x, y):
    """Compute gradients of MSE loss"""
    n = len(x)
    predictions = w * x + b
    errors = predictions - y
    
    dL_dw = (2/n) * np.sum(errors * x)
    dL_db = (2/n) * np.sum(errors)
    
    return np.array([dL_dw, dL_db])

# Example: fitting a line
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
w, b = 0.5, 1.0

grad = mse_gradients(w, b, x, y)
print(f"∂L/∂w = {grad[0]:.3f}")
print(f"∂L/∂b = {grad[1]:.3f}")

Partial Derivative Rules Summary

Function Type	$\frac{\partial}{\partial x}$	Example
$x^n$	$nx^{n-1}$	$\frac{\partial}{\partial x}x^3 = 3x^2$
$y$ (constant)	$0$	$\frac{\partial}{\partial x}y^2 = 0$
$xy$	$y$	$\frac{\partial}{\partial x}xy = y$
$x^m y^n$	$mx^{m-1}y^n$	$\frac{\partial}{\partial x}x^2y^3 = 2xy^3$
$e^{xy}$	$ye^{xy}$	Chain rule: $(e^u)' \cdot \frac{\partial u}{\partial x}$
$\ln(xy)$	$\frac{1}{x}$	$\frac{1}{xy} \cdot y = \frac{1}{x}$
$\sin(xy)$	$y\cos(xy)$	Chain rule

Example 1: Optimizing Your Business

The Problem

Let’s formalize your CEO problem. You want to maximize Revenue based on two investments:

$x$ = Advertising Budget ($1000s)
$y$ = Product Quality Investment ($1000s)

Your Revenue Function:

R(x, y) = 100x + 80y - x^2 - y^2 - 0.5xy

Your Goal: Find the perfect budget allocation (

x, y

) that maximizes

R

Visualizing Your Landscape

Here is what your revenue landscape looks like. The gradient (red arrow) shows you the fastest way to the top.

Step 1: Compute the Gradient

You need to find the partial derivatives (the slope in each direction):

Slope w.r.t Ad Budget ( $x$ ): $\frac{\partial R}{\partial x} = 100 - 2x - 0.5y$ (Treat $y$ as constant number)
Slope w.r.t Quality ( $y$ ): $\frac{\partial R}{\partial y} = 80 - 2y - 0.5x$ (Treat $x$ as constant number)

Your Gradient Vector:

\nabla R = \begin{bmatrix} 100 - 2x - 0.5y \\ 80 - 2y - 0.5x \end{bmatrix}

Step 2: Check Your Current Strategy

Suppose you are currently spending:

$x = 20$ ($20k on Ads)
$y = 15$ ($15k on Quality)

Let’s plug these into your gradient:

import numpy as np

def revenue_gradient(x, y):
    dR_dx = 100 - 2*x - 0.5*y
    dR_dy = 80 - 2*y - 0.5*x
    return np.array([dR_dx, dR_dy])

# Your current allocation
x, y = 20, 15
grad = revenue_gradient(x, y)

print(f"Current allocation: Ad=${x}k, Quality=${y}k")
print(f"Gradient: {grad}")

Output:

Current allocation: Ad=$20k, Quality=$15k
Gradient: [52.5, 40.0]

Interpretation:

52.5: Increasing Ad spend is HIGHLY profitable.
40.0: Increasing Quality is ALSO profitable, but slightly less so.
Action: Increase both, but prioritize Ads slightly more.

Step 3: Find the Optimal Allocation

To find the absolute peak, you want the point where the slope is ZERO in all directions (flat top). Set Gradient to 0:

\begin{cases} 100 - 2x - 0.5y = 0 \\ 80 - 2y - 0.5x = 0 \end{cases}

Solving this system (using linear algebra or substitution):

# Solve system of equations
# 2x + 0.5y = 100
# 0.5x + 2y = 80

A = np.array([[2, 0.5], [0.5, 2]])
b = np.array([100, 80])
optimal = np.linalg.solve(A, b)

x_opt, y_opt = optimal
print(f"Optimal Ad Budget: ${x_opt:.2f}k")
print(f"Optimal Quality Budget: ${y_opt:.2f}k")

Output:

Optimal Ad Budget: $42.67k
Optimal Quality Budget: $29.33k

Result: You found the perfect strategy! Spend

42.6k on ads and

29.3k on quality to maximize revenue. Real Application: Google uses this exact math to optimize ad auctions, balancing multiple metrics (CTR, bid price, user relevance) simultaneously.

Example 2: Optimizing Your Grades

The Problem

You want to maximize your overall GPA across 3 subjects:

$x$ = hours/week on Math
$y$ = hours/week on English
$z$ = hours/week on Science

Your GPA Function:

G(x, y, z) = \sqrt{x} + \sqrt{y} + \sqrt{z} - 0.01(x^2 + y^2 + z^2)

(Square roots represent learning; squared terms represent burnout/fatigue) Constraint: You only have 30 hours/week total.

Computing Your Gradient

The gradient tells you: “If I add 1 hour of study, which subject gives the biggest GPA boost?”

def gpa_gradient(x, y, z):
    # Partial derivatives (marginal benefit - marginal cost)
    dG_dx = 0.5/np.sqrt(x) - 0.02*x
    dG_dy = 0.5/np.sqrt(y) - 0.02*y
    dG_dz = 0.5/np.sqrt(z) - 0.02*z
    return np.array([dG_dx, dG_dy, dG_dz])

# Your current schedule
x, y, z = 10, 12, 8  # hours per subject

grad = gpa_gradient(x, y, z)

print(f"Current Schedule: Math={x}h, English={y}h, Science={z}h")
print(f"Gradient: {grad}")

Output:

Current Schedule: Math=10h, English=12h, Science=8h
Gradient: [-0.042, -0.096, 0.017]

Interpretation:

Math (-0.042): Negative! Studying MORE math will actually LOWER your GPA (burnout).
English (-0.096): Very Negative! You are over-studying English.
Science (+0.017): Positive! You should shift time to Science.

Action: Study less English/Math, study more Science!

Example 3: Tuning Your Recommendation System

The Problem

You are building a Netflix-style recommender. You have 3 knobs to tune:

$\alpha$ = Recency weight (how much recent views matter)
$\beta$ = Popularity weight (how much overall hits matter)
$\gamma$ = Personalization weight (how much user history matters)

Your Error Function (Lower is better):

E(\alpha, \beta, \gamma) = (\alpha - 0.6)^2 + (\beta - 0.3)^2 + (\gamma - 0.8)^2 + 0.1\alpha\beta

Goal: Find the knob settings that minimize error.

Gradient Descent Optimization

Since we want to MINIMIZE error, we move opposite to the gradient.

def error_gradient(alpha, beta, gamma):
    dE_dalpha = 2*(alpha - 0.6) + 0.1*beta
    dE_dbeta = 2*(beta - 0.3) + 0.1*alpha
    dE_dgamma = 2*(gamma - 0.8)
    return np.array([dE_dalpha, dE_dbeta, dE_dgamma])

# Start with random settings
params = np.array([0.2, 0.5, 0.4])
learning_rate = 0.1

print("Optimizing your system...")
for step in range(15):
    grad = error_gradient(*params)
    params = params - learning_rate * grad  # Move OPPOSITE to gradient
    
    if step % 5 == 0:
        print(f"Step {step}: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")

print(f"Optimal Settings: α={params[0]:.2f}, β={params[1]:.2f}, γ={params[2]:.2f}")

Real Application: Real recommendation systems optimize thousands of such parameters automatically using this exact method!

Directional Derivatives: Choosing Your Path

The Question

The gradient tells you the steepest way up. But what if you can’t go that way? What if you want to go Northeast? Directional Derivative answers: “How fast will I climb if I walk in THIS specific direction?”

The Formula

To find the rate of change in direction

\mathbf{v}

\text{Rate} = \nabla f \cdot \mathbf{v} \quad (\text{Dot Product})

If direction is same as gradient → Max rate (Steepest ascent)
If direction is perpendicular → Zero rate (Walking flat)
If direction is opposite → Negative rate (Steepest descent)

# Gradient at your position
grad = np.array([39, 40])

# You want to move Northeast (45 degrees)
direction = np.array([1, 1]) 
direction = direction / np.linalg.norm(direction) # Normalize length to 1

# How fast will you climb?
rate = np.dot(grad, direction)
print(f"Climbing rate in Northeast direction: {rate:.2f}")

Key Insight: The dot product measures “alignment”. The more your direction aligns with the gradient, the faster you climb!

Hessian Matrix (Second Derivatives)

What Is It?

Matrix of all second partial derivatives:

H = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix}

Why It Matters

Hessian tells you about curvature:

Positive definite → Local minimum
Negative definite → Local maximum
Indefinite → Saddle point

# Example: f(x,y) = x² + y²
def hessian_simple():
    return np.array([
        [2, 0],
        [0, 2]
    ])

H = hessian_simple()
eigenvalues = np.linalg.eigvals(H)
print(f"Eigenvalues: {eigenvalues}")
# Both positive → minimum!

Practice Exercises

Exercise 1: Profit Optimization

# Profit function with 2 variables
def profit(x, y):
    return 50*x + 40*y - x**2 - y**2 - 0.5*x*y

# TODO:
# 1. Compute the gradient
# 2. Find the point where gradient = 0
# 3. Verify it's a maximum using the Hessian

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises show how gradients guide decisions in business, ML, and everyday life.

Exercise 1: Marketing Budget Allocation 📊

A company has a marketing budget to split between Google Ads and Instagram:

import numpy as np

# Conversion rate depends on both channels (they interact!)
# Conversions(g, i) = 100*sqrt(g) + 80*sqrt(i) + 10*sqrt(g*i)
# where g = Google spend ($000s), i = Instagram spend ($000s)
#
# Total budget: $50,000 (g + i = 50)
# Revenue per conversion: $50

# TODO:
# 1. Write the profit function (revenue - costs)
# 2. Compute the gradient ∇Profit
# 3. Find the optimal allocation
# 4. What's the gradient at g=25, i=25? What does it tell you?

💡 Solution

import numpy as np

def conversions(g, i):
    """Total conversions from both channels"""
    return 100*np.sqrt(g) + 80*np.sqrt(i) + 10*np.sqrt(g*i)

def profit(g, i, revenue_per_conv=50):
    """Profit = Revenue - Costs"""
    return revenue_per_conv * conversions(g, i) - (g + i) * 1000

def gradient(g, i, revenue_per_conv=50):
    """
    ∂P/∂g = 50 * (50/√g + 5*√(i/g)) - 1000
    ∂P/∂i = 50 * (40/√i + 5*√(g/i)) - 1000
    """
    dP_dg = revenue_per_conv * (50/np.sqrt(g) + 5*np.sqrt(i/g)) - 1000
    dP_di = revenue_per_conv * (40/np.sqrt(i) + 5*np.sqrt(g/i)) - 1000
    return np.array([dP_dg, dP_di])

print("📊 Marketing Budget Optimization")
print("=" * 55)

# Current equal split
g_curr, i_curr = 25, 25
grad = gradient(g_curr, i_curr)
print(f"\n📍 Current Split: Google=${g_curr}k, Instagram=${i_curr}k")
print(f"   Conversions: {conversions(g_curr, i_curr):.0f}")
print(f"   Profit: ${profit(g_curr, i_curr):,.0f}")
print(f"   Gradient: [{grad[0]:.2f}, {grad[1]:.2f}]")
print(f"\n   💡 Interpretation:")
print(f"   • Google marginal value: ${grad[0]:.0f} per $1k extra")
print(f"   • Instagram marginal value: ${grad[1]:.0f} per $1k extra")
if grad[0] > grad[1]:
    print(f"   → Shift budget TO Google!")
else:
    print(f"   → Shift budget TO Instagram!")

# Gradient descent to find optimal (with constraint g + i = 50)
def optimize_constrained():
    g = 25.0
    lr = 0.5
    for _ in range(100):
        grad = gradient(g, 50-g)
        # Move budget based on difference in marginal values
        g = g + lr * (grad[0] - grad[1]) / 2000
        g = np.clip(g, 1, 49)  # Keep valid
    return g, 50-g

g_opt, i_opt = optimize_constrained()
print(f"\n🎯 Optimal Allocation:")
print(f"   Google: ${g_opt:.1f}k ({g_opt/50*100:.0f}%)")
print(f"   Instagram: ${i_opt:.1f}k ({i_opt/50*100:.0f}%)")
print(f"   Profit: ${profit(g_opt, i_opt):,.0f}")
print(f"\n📈 Improvement: +${profit(g_opt, i_opt) - profit(25, 25):,.0f} vs equal split")

Real-World Insight: This is exactly how performance marketing teams at Google, Meta, and agencies optimize ad spend. The gradient tells you where your next dollar is most valuable!

Exercise 2: Neural Network Weight Update 🧠

Manually compute a gradient update for a tiny neural network:

import numpy as np

# Simple network: 2 inputs → 2 weights → 1 output
# y = w1*x1 + w2*x2
# Loss = (y - target)²

# Data point: x = [3, 4], target = 10
# Current weights: w = [1, 1]
# Predicted: y = 1*3 + 1*4 = 7
# Loss = (7 - 10)² = 9

# TODO:
# 1. Compute ∂Loss/∂w1 and ∂Loss/∂w2
# 2. Update weights with learning rate 0.1
# 3. Compute new prediction and loss
# 4. Repeat for 5 steps and watch loss decrease

💡 Solution

import numpy as np

def predict(w, x):
    return w[0]*x[0] + w[1]*x[1]

def loss(y_pred, y_true):
    return (y_pred - y_true) ** 2

def gradient(w, x, y_true):
    """
    L = (w1*x1 + w2*x2 - target)²
    ∂L/∂w1 = 2(y_pred - target) * x1
    ∂L/∂w2 = 2(y_pred - target) * x2
    """
    y_pred = predict(w, x)
    error = y_pred - y_true
    return np.array([2 * error * x[0], 2 * error * x[1]])

print("🧠 Neural Network Gradient Descent")
print("=" * 55)

# Setup
x = np.array([3, 4])
target = 10
w = np.array([1.0, 1.0])
lr = 0.1

print(f"Data: x = {x}, target = {target}")
print(f"Initial weights: w = {w}")
print(f"Learning rate: {lr}")
print("\n" + "-" * 55)
print(f"{'Step':<6} {'Weights':<20} {'Pred':<8} {'Loss':<10} {'Gradient'}")
print("-" * 55)

for step in range(6):
    y_pred = predict(w, x)
    L = loss(y_pred, target)
    grad = gradient(w, x, target)
    
    print(f"{step:<6} [{w[0]:.3f}, {w[1]:.3f}]     {y_pred:<8.2f} {L:<10.4f} [{grad[0]:.2f}, {grad[1]:.2f}]")
    
    # Update weights
    w = w - lr * grad

print("-" * 55)
print(f"\n✅ Final weights: w = [{w[0]:.3f}, {w[1]:.3f}]")
print(f"   Prediction: {predict(w, x):.4f} (target was {target})")
print(f"   Loss reduced from 9.0 to {loss(predict(w, x), target):.6f}")

# Verify: perfect weights would be [1, 1.75] (1*3 + 1.75*4 = 10)
print(f"\n💡 Perfect weights: [1.0, 1.75] → {1*3 + 1.75*4}")

Real-World Insight: This is the fundamental update rule in ALL neural network training! PyTorch, TensorFlow, and JAX all do exactly this - just with millions of weights and clever optimizations.

You’re a robot navigating a temperature field. Find the hottest spot:

import numpy as np

# Temperature field (2D Gaussian peaks)
# T(x, y) = 80*exp(-((x-3)² + (y-2)²)/10) + 60*exp(-((x+2)² + (y+1)²)/5)
# Two heat sources: one at (3, 2), another at (-2, -1)

# You start at position (0, 0)
# Use gradient ascent to find the hottest spot

# TODO:
# 1. Compute the gradient of T
# 2. Implement gradient ascent
# 3. Which heat source do you reach?
# 4. Try different starting positions - do you reach different peaks?

💡 Solution

import numpy as np

def temperature(x, y):
    """Two Gaussian heat sources"""
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)  # Peak at (3, 2), max=80
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)   # Peak at (-2, -1), max=60
    return peak1 + peak2

def gradient_T(x, y):
    """Gradient of temperature field"""
    # For peak1: 80*exp(-((x-3)² + (y-2)²)/10)
    # ∂/∂x = 80 * exp(...) * (-2(x-3)/10) = peak1 * (-(x-3)/5)
    peak1 = 80 * np.exp(-((x-3)**2 + (y-2)**2) / 10)
    peak2 = 60 * np.exp(-((x+2)**2 + (y+1)**2) / 5)
    
    dT_dx = peak1 * (-(x-3) / 5) + peak2 * (-(x+2) / 2.5)
    dT_dy = peak1 * (-(y-2) / 5) + peak2 * (-(y+1) / 2.5)
    
    return np.array([dT_dx, dT_dy])

def gradient_ascent(start_x, start_y, lr=0.5, steps=50):
    """Climb the temperature gradient"""
    x, y = start_x, start_y
    path = [(x, y, temperature(x, y))]
    
    for _ in range(steps):
        grad = gradient_T(x, y)
        x = x + lr * grad[0]
        y = y + lr * grad[1]
        path.append((x, y, temperature(x, y)))
        
        if np.linalg.norm(grad) < 0.01:
            break
    
    return x, y, path

print("🗺️ Heat Map Navigation (Gradient Ascent)")
print("=" * 55)
print("Heat sources: Peak1 at (3, 2) = 80°C, Peak2 at (-2, -1) = 60°C")

# Test different starting positions
starts = [(0, 0), (5, 0), (-3, 0), (0, 3), (0, -3)]

print("\n📍 Starting Position → Final Position → Peak Reached")
print("-" * 55)

for sx, sy in starts:
    fx, fy, path = gradient_ascent(sx, sy)
    final_temp = temperature(fx, fy)
    
    # Determine which peak
    if fx > 0:
        peak = "Peak1 (80°C)"
    else:
        peak = "Peak2 (60°C)"
    
    print(f"   ({sx:3}, {sy:3}) → ({fx:.1f}, {fy:.1f}) → {peak} at {final_temp:.1f}°C")

# Detailed path from origin
print("\n🚶 Detailed Path from (0, 0):")
_, _, path = gradient_ascent(0, 0)
print("   Step | Position    | Temperature | Gradient")
print("   -----|-------------|-------------|----------")
for i in [0, 5, 10, 20, len(path)-1]:
    if i < len(path):
        x, y, t = path[i]
        g = gradient_T(x, y)
        print(f"   {i:4} | ({x:4.1f}, {y:4.1f}) | {t:11.2f} | ({g[0]:5.2f}, {g[1]:5.2f})")

print("\n💡 Key Insight:")
print("   Gradient ascent finds LOCAL maxima - you reach")
print("   whichever peak you're closest to initially!")
print("   This is why neural networks can get stuck in local minima!")

Real-World Insight: This local vs global optimum problem is fundamental in ML. It’s why we use random initialization, momentum, and techniques like simulated annealing to escape local optima!

Exercise 4: Portfolio Optimization 💼

Find the optimal stock allocation to maximize risk-adjusted return:

import numpy as np

# Two stocks: A (high risk/return) and B (low risk/return)
# Expected return: R(a, b) = 0.15*a + 0.08*b (a, b are allocation fractions)
# Variance (risk): V(a, b) = 0.04*a² + 0.01*b² + 0.01*a*b
# 
# Sharpe ratio (risk-adjusted return): S = R / sqrt(V)
# Constraint: a + b = 1 (fully invested)

# TODO:
# 1. Express S in terms of a only (since b = 1 - a)
# 2. Find the gradient ∂S/∂a
# 3. Find optimal allocation
# 4. Compare with 50/50 split

💡 Solution

import numpy as np

def returns(a):
    """Expected return: R = 0.15*a + 0.08*(1-a)"""
    b = 1 - a
    return 0.15 * a + 0.08 * b

def variance(a):
    """Portfolio variance"""
    b = 1 - a
    return 0.04 * a**2 + 0.01 * b**2 + 0.01 * a * b

def sharpe(a):
    """Sharpe ratio = Return / Risk"""
    return returns(a) / np.sqrt(variance(a))

def sharpe_gradient(a, eps=1e-6):
    """Numerical gradient for Sharpe ratio"""
    return (sharpe(a + eps) - sharpe(a - eps)) / (2 * eps)

print("💼 Portfolio Optimization")
print("=" * 55)
print("Stock A: 15% return, 20% volatility (high risk)")
print("Stock B: 8% return, 10% volatility (low risk)")
print("Correlation: 0.5")

# Gradient ascent to find optimal allocation
a = 0.5  # Start at 50/50
lr = 0.5
history = [(a, sharpe(a))]

for _ in range(50):
    grad = sharpe_gradient(a)
    a_new = a + lr * grad
    a = np.clip(a_new, 0, 1)  # Keep valid allocation
    history.append((a, sharpe(a)))
    if abs(grad) < 1e-6:
        break

optimal_a = a

print(f"\n🎯 Optimal Allocation:")
print(f"   Stock A (high risk): {optimal_a*100:.1f}%")
print(f"   Stock B (low risk): {(1-optimal_a)*100:.1f}%")
print(f"   Expected Return: {returns(optimal_a)*100:.2f}%")
print(f"   Portfolio Risk: {np.sqrt(variance(optimal_a))*100:.2f}%")
print(f"   Sharpe Ratio: {sharpe(optimal_a):.4f}")

# Comparison table
print("\n📊 Allocation Comparison:")
print("   Allocation | Return | Risk   | Sharpe")
print("   ----------|--------|--------|--------")
for alloc, label in [(0, "100% B"), (0.5, "50/50"), (optimal_a, "Optimal"), (1, "100% A")]:
    r = returns(alloc)
    v = np.sqrt(variance(alloc))
    s = sharpe(alloc)
    marker = " ←" if abs(alloc - optimal_a) < 0.01 else ""
    print(f"   {label:9} | {r*100:5.1f}% | {v*100:5.1f}% | {s:.4f}{marker}")

print(f"\n💡 Key Insight:")
print(f"   The gradient told us to shift from 50/50 toward higher Stock A")
print(f"   allocation, but not 100% - diversification reduces risk!")

Real-World Insight: This is Modern Portfolio Theory (Markowitz, Nobel Prize 1990). Every robo-advisor (Wealthfront, Betterment) uses gradient-based optimization to find efficient portfolios!

🎯 Practice Problems: Test Your Understanding

Before moving on, make sure you can solve these problems. They’re ordered by difficulty.

Problem 1: Basic Partial Derivatives (Easy)

Given:

f(x, y) = 3x^2 + 4xy - y^2 + 5

Find:

$\frac{\partial f}{\partial x}$
$\frac{\partial f}{\partial y}$
$\nabla f$ at point $(1, 2)$

Show Solution

Step 1: Find

\frac{\partial f}{\partial x}

(treat

y

as constant):

\frac{\partial f}{\partial x} = 6x + 4y + 0 + 0 = 6x + 4y

Step 2: Find

\frac{\partial f}{\partial y}

(treat

x

as constant):

\frac{\partial f}{\partial y} = 0 + 4x - 2y + 0 = 4x - 2y

Step 3: Evaluate at

(1, 2)

\nabla f(1, 2) = \begin{bmatrix}6(1) + 4(2) \\ 4(1) - 2(2)\end{bmatrix} = \begin{bmatrix}14 \\ 0\end{bmatrix}

Interpretation: At point (1, 2), the function increases fastest in the x-direction. The zero in y means changing y alone (at this point) doesn’t change f at first order.

Problem 2: Product Rule (Medium)

Given:

g(x, y) = x^2 e^y

Find:

\nabla g

Show Solution

Find $\frac{\partial g}{\partial x}$ : Treat

e^y

as a constant:

\frac{\partial g}{\partial x} = 2x \cdot e^y = 2xe^y

Find $\frac{\partial g}{\partial y}$ : Treat

x^2

as a constant:

\frac{\partial g}{\partial y} = x^2 \cdot e^y = x^2e^y

The Gradient:

\nabla g = \begin{bmatrix}2xe^y \\ x^2e^y\end{bmatrix} = e^y\begin{bmatrix}2x \\ x^2\end{bmatrix}

Problem 3: Find the Optimum (Medium)

Given:

h(x, y) = -x^2 - 2y^2 + 4x + 8y - 10

Find: The point

(x^*, y^*)

where

\nabla h = 0

(the critical point).

Show Solution

Step 1: Compute gradient:

\frac{\partial h}{\partial x} = -2x + 4

\frac{\partial h}{\partial y} = -4y + 8

Step 2: Set each component to zero:

-2x + 4 = 0 \implies x = 2

-4y + 8 = 0 \implies y = 2

The critical point is $(2, 2)$ .Step 3: Verify it’s a maximum (Hessian check):

H = \begin{bmatrix}-2 & 0 \\ 0 & -4\end{bmatrix}

Both eigenvalues are negative, so this is indeed a maximum!Value at maximum:

h(2, 2) = -4 - 8 + 8 + 16 - 10 = 2

Problem 4: ML Loss Function (Hard)

Given: The MSE loss for linear regression with 3 data points:

$(x_1, y_1) = (1, 3)$
$(x_2, y_2) = (2, 5)$
$(x_3, y_3) = (3, 7)$

Model:

\hat{y} = wx + b

Loss:

L(w, b) = \frac{1}{3}\sum_{i=1}^{3}(\hat{y}_i - y_i)^2

Find:

\frac{\partial L}{\partial w}

and

\frac{\partial L}{\partial b}

w=1, b=1

Show Solution

Step 1: Compute predictions at

w=1, b=1

$\hat{y}_1 = 1(1) + 1 = 2$ (actual: 3, error: -1)
$\hat{y}_2 = 1(2) + 1 = 3$ (actual: 5, error: -2)
$\hat{y}_3 = 1(3) + 1 = 4$ (actual: 7, error: -3)

Step 2: Loss formula expanded:

L = \frac{1}{3}[(w \cdot 1 + b - 3)^2 + (w \cdot 2 + b - 5)^2 + (w \cdot 3 + b - 7)^2]

Step 3: Gradient formulas:

\frac{\partial L}{\partial w} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i) \cdot x_i

\frac{\partial L}{\partial b} = \frac{2}{3}\sum_{i=1}^{3}(wx_i + b - y_i)

Step 4: Evaluate:

\frac{\partial L}{\partial w} = \frac{2}{3}[(-1)(1) + (-2)(2) + (-3)(3)] = \frac{2}{3}(-1 - 4 - 9) = \frac{2}{3}(-14) = -\frac{28}{3} \approx -9.33

\frac{\partial L}{\partial b} = \frac{2}{3}[(-1) + (-2) + (-3)] = \frac{2}{3}(-6) = -4

Interpretation: Both gradients are negative, meaning increasing

w

and

b

will DECREASE the loss (which is what we want!). The true optimal values are

w=2, b=1

🔑 Key Takeaways

Gradient Essentials:

✅ Gradient - Vector of all partial derivatives; ∇f = [∂f/∂x₁, ∂f/∂x₂, …]
✅ Direction - Points toward steepest ascent; negate for descent
✅ Magnitude - Tells you how steep the slope is at that point
✅ Optimization - Critical points where ∇f = 0
✅ Hessian - Second derivatives tell if it’s min (positive definite), max, or saddle

Interview Prep: Gradient Questions

Common Gradient Interview Questions

Q: What does the gradient represent geometrically?

The gradient points in the direction of steepest increase. Its magnitude indicates how steep that ascent is. For a loss function, we move in the opposite direction (−∇f) to find the minimum.

Q: Why can’t we just set the gradient to zero and solve for neural networks?

Neural networks have millions of parameters with highly non-linear, non-convex loss surfaces. There’s no closed-form solution. We must use iterative gradient descent to find good (local) minima.

Q: What’s the Hessian and when is it useful?

The Hessian is the matrix of second partial derivatives. It tells us about curvature: positive definite = minimum, negative definite = maximum, indefinite = saddle point. Second-order methods use it for faster convergence but are expensive.

Common Pitfalls

Gradient Mistakes to Avoid:

Sign Confusion - Gradient points uphill; for minimization, move in the OPPOSITE direction
Ignoring Multiple Variables - Partial derivatives hold others constant; the gradient combines them all
Assuming Global Optimum - Non-convex functions have local minima; gradient zero doesn’t mean global best
Dimension Mismatch - Gradient has same dimension as input; ∇f : ℝⁿ → ℝⁿ

What’s Next?

You now understand gradients for multi-variable functions. But how do we handle COMPOSITIONS of functions (like neural networks with many layers)? That’s the chain rule - and it’s the key to backpropagation!

Next: Chain Rule & Backpropagation

Discover how neural networks learn through backpropagation

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Gradients & Multivariable Calculus

​Why This Module Matters for ML

​Your Challenge: The CEO’s Dilemma

​The Hiker in the Fog

​What Is a Gradient?

​Intuitive Definition

​Mathematical Definition

​The Key Insight: One Variable at a Time

​Let’s Solve Your CEO Problem

​Partial Derivatives: Step-by-Step Guide

​The Key Rule

​Example 1: Basic Polynomial

​Example 2: Mixed Terms

​Example 3: Common ML Functions

​Partial Derivative Rules Summary

​Example 1: Optimizing Your Business

​The Problem

​Visualizing Your Landscape

​Step 1: Compute the Gradient

​Step 2: Check Your Current Strategy

​Step 3: Find the Optimal Allocation

​Example 2: Optimizing Your Grades

​The Problem

​Computing Your Gradient

​Example 3: Tuning Your Recommendation System

​The Problem

​Gradient Descent Optimization

​Directional Derivatives: Choosing Your Path

​The Question

​The Formula

​Hessian Matrix (Second Derivatives)

​What Is It?

​Why It Matters

​Practice Exercises

​Exercise 1: Profit Optimization

​🎯 Practice Exercises & Real-World Applications

​Exercise 1: Marketing Budget Allocation 📊

​Exercise 2: Neural Network Weight Update 🧠

Gradients & Multivariable Calculus

Why This Module Matters for ML

Your Challenge: The CEO’s Dilemma

The Hiker in the Fog

What Is a Gradient?

Intuitive Definition

Mathematical Definition

The Key Insight: One Variable at a Time

Let’s Solve Your CEO Problem

Partial Derivatives: Step-by-Step Guide

The Key Rule

Example 1: Basic Polynomial

Example 2: Mixed Terms

Example 3: Common ML Functions

Partial Derivative Rules Summary

Example 1: Optimizing Your Business

The Problem

Visualizing Your Landscape

Step 1: Compute the Gradient

Step 2: Check Your Current Strategy

Step 3: Find the Optimal Allocation

Example 2: Optimizing Your Grades

The Problem

Computing Your Gradient

Example 3: Tuning Your Recommendation System

The Problem

Gradient Descent Optimization

Directional Derivatives: Choosing Your Path

The Question

The Formula

Hessian Matrix (Second Derivatives)

What Is It?

Why It Matters

Practice Exercises

Exercise 1: Profit Optimization

🎯 Practice Exercises & Real-World Applications

Exercise 1: Marketing Budget Allocation 📊

Exercise 2: Neural Network Weight Update 🧠

Exercise 3: Heat Map Navigation 🗺️

Exercise 4: Portfolio Optimization 💼

🎯 Practice Problems: Test Your Understanding

🔑 Key Takeaways

Interview Prep: Gradient Questions

Common Pitfalls

What’s Next?