Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Gradient Descent Algorithm

Gradient Descent

Your Challenge: Lost in the Mountains

Imagine you are dropped onto a random spot in a vast, foggy mountain range at night.
  • You can’t see the bottom (the valley).
  • You can’t see more than 3 feet in front of you.
  • You have no map.
Your Goal: Find the absolute lowest point in the entire valley (the minimum). How do you do it? You can’t “solve” the mountain. You can’t just teleport to the bottom. You have to feel your way down.
  1. You feel the slope under your feet.
  2. You take a small step downhill.
  3. You repeat this thousands of times.
Eventually, you reach the bottom.

The Algorithm Visualized

Gradient Descent Path This is Gradient Descent. It’s the “blind hiker” algorithm.
  • The Mountain: Your Loss Function (Error).
  • Your Position: The current weights of your model.
  • The Slope: The Gradient.
  • The Step Size: The Learning Rate.
  • The Bottom: The Optimal Weights (Best Model).
🔗 ML Connection: Gradient descent is THE learning algorithm. Here’s where it runs:
SystemParametersGradient Steps
GPT-41.7 trillionBillions of updates
Stable Diffusion1 billion~1 million steps
BERT340 million~1 million steps
Your first model2 (w, b)~100-1000 steps
The math is identical—only the scale changes!

The Core Intuition

Why We Need It

In high school math, you found the minimum by setting the derivative to zero: f(x)=0f'(x) = 0 That works for simple parabolas like x2x^2. But in Deep Learning, your function looks like a crumpled piece of paper in 1,000,000 dimensions. You cannot solve f(x)=0f'(x) = 0 algebraically. It is impossible. Think of the difference like this. Finding where the derivative equals zero analytically is like solving a Sudoku puzzle by reasoning through every constraint — elegant, but only feasible for small puzzles. Gradient descent is like a GPS navigation system: you do not need a complete map of the terrain. You just need to know which direction is downhill from where you are right now, and you take one step at a time. GPS works in any city, no matter how complex the road network. Gradient descent works for any differentiable function, no matter how many variables. So instead of solving for the answer directly, we search for it iteratively.

The Code: A Simple Descent

Let’s implement the “blind hiker” logic for a simple valley: f(x)=x24x+5f(x) = x^2 - 4x + 5.
import numpy as np

# 1. The Mountain (Loss Function)
def f(x):
    return x**2 - 4*x + 5

# 2. The Slope (Gradient)
def gradient(x):
    return 2*x - 4  # Derivative of x² - 4x + 5

# 3. The Hike
x = 0  # Start at random spot
learning_rate = 0.1  # Size of each step

print(f"Start at x={x}")

for step in range(20):
    # Feel the slope
    grad = gradient(x)
    
    # Take a step downhill (opposite to gradient)
    x = x - learning_rate * grad
    
    print(f"Step {step+1}: Moved to x={x:.4f}, Height={f(x):.4f}")

print(f"\nReached bottom at x={x:.4f}")
Output:
Start at x=0
Step 1: Moved to x=0.4000, Height=3.5600
Step 2: Moved to x=0.7200, Height=2.6384
...
Step 20: Moved to x=1.9769, Height=1.0005
Reached bottom at x=1.9769
Result: You started at 0 and walked your way to ~2 (the true minimum). You solved it without algebra!
🎮 Interactive Visualization: Explore gradient descent interactively!
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

def f(x):
    return x**2 - 4*x + 5

def gradient(x):
    return 2*x - 4

# Animate gradient descent
fig, ax = plt.subplots(figsize=(10, 6))
x_range = np.linspace(-1, 5, 100)
ax.plot(x_range, f(x_range), 'b-', linewidth=2, label='f(x) = x² - 4x + 5')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('Gradient Descent Animation')
ax.legend()

# Initialize point
point, = ax.plot([], [], 'ro', markersize=15)
path_x, path_y = [], []
path_line, = ax.plot([], [], 'r--', alpha=0.5)

x_current = 0
lr = 0.1

def init():
    point.set_data([], [])
    path_line.set_data([], [])
    return point, path_line

def animate(frame):
    global x_current, path_x, path_y
    if frame == 0:
        x_current = 0
        path_x, path_y = [0], [f(0)]
    else:
        grad = gradient(x_current)
        x_current = x_current - lr * grad
        path_x.append(x_current)
        path_y.append(f(x_current))
    
    point.set_data([x_current], [f(x_current)])
    path_line.set_data(path_x, path_y)
    ax.set_title(f'Step {frame}: x = {x_current:.4f}, f(x) = {f(x_current):.4f}')
    return point, path_line

anim = FuncAnimation(fig, animate, init_func=init, frames=25, interval=300, blit=True)
HTML(anim.to_jshtml())  # Run in Jupyter!

The Algorithm

Mathematical Formulation

Update rule: xnew=xoldαf(xold)x_{new} = x_{old} - \alpha \cdot \nabla f(x_{old}) Where:
  • α\alpha = learning rate (step size)
  • f\nabla f = gradient (direction of steepest ascent)
  • We subtract because we want to go downhill

Pseudocode

1. Initialize parameters randomly
2. Repeat until convergence:
   a. Compute gradient at current point
   b. Update parameters: x = x - α × gradient
   c. Check if converged (gradient ≈ 0)
3. Return optimized parameters


Example 1: Training Your First Model

The Problem

You want to predict house prices based on square footage. Price=wsqft+b\text{Price} = w \cdot \text{sqft} + b Your Goal: Find the best ww (weight) and bb (bias) that minimize the error.

The Data

# Training data
sqft = np.array([1.0, 1.5, 2.0, 2.5, 3.0]) # in 1000s sqft
prices = np.array([200, 250, 300, 350, 400])  # in $1000s

Gradient Descent Training

def predict(x, w, b):
    return w * x + b

def compute_gradients(w, b, x, y):
    n = len(x)
    pred = predict(x, w, b)
    error = pred - y
    
    # Gradients (Partial Derivatives of MSE)
    dL_dw = (2/n) * np.sum(error * x)
    dL_db = (2/n) * np.sum(error)
    return dL_dw, dL_db

# 1. Initialize randomly
w, b = 0.0, 0.0
learning_rate = 0.01

print("Training started...")
for epoch in range(1000):
    # 2. Compute Gradient
    grad_w, grad_b = compute_gradients(w, b, sqft, prices)
    
    # 3. Update Parameters (Step Downhill)
    w = w - learning_rate * grad_w
    b = b - learning_rate * grad_b
    
    if epoch % 100 == 0:
        loss = np.mean((predict(sqft, w, b) - prices)**2)
        print(f"Epoch {epoch}: Loss={loss:.2f}, w={w:.2f}, b={b:.2f}")

print(f"\nFinal Model: Price = {w:.2f} * sqft + {b:.2f}")
print("True Answer: Price = 100 * sqft + 100")
Result: Your model learned the relationship perfectly!

Example 2: Optimizing Your Prices

The Scenario

You run an e-commerce site. You control two things:
  • x1x_1 = Ad spend ($1000s)
  • x2x_2 = Discount percentage
Your Revenue Function (Unknown to you, but we simulate it): R(x1,x2)=100x1+50x2x12x22R(x_1, x_2) = 100x_1 + 50x_2 - x_1^2 - x_2^2 Your Goal: Maximize Revenue (which means minimizing Negative Revenue).

Gradient Descent Optimization

def neg_revenue_gradient(x1, x2):
    # Gradient of -R (to minimize)
    # R = 100x + 50y - x^2 - y^2
    # dR/dx = 100 - 2x  ->  d(-R)/dx = 2x - 100
    # dR/dy = 50 - 2y   ->  d(-R)/dy = 2y - 50
    grad_x1 = 2*x1 - 100
    grad_x2 = 2*x2 - 50
    return np.array([grad_x1, grad_x2])

# Start with random strategy
strategy = np.array([0.0, 0.0]) # $0 ads, 0% discount
lr = 0.1

for step in range(50):
    grad = neg_revenue_gradient(strategy[0], strategy[1])
    strategy = strategy - lr * grad
    
    if step % 10 == 0:
        print(f"Step {step}: Ads=${strategy[0]:.2f}k, Discount={strategy[1]:.2f}%")

print(f"\nOptimal Strategy: Ads=${strategy[0]:.2f}k, Discount={strategy[1]:.2f}%")
Output:
Optimal Strategy: Ads=$50.00k, Discount=$25.00%
Insight: You found the optimal business strategy just by following the gradient!

Example 3: Training Your Neural Network

The Challenge

You want to train a simple neural network to solve a problem.
  • Input: xx
  • Target: yy
  • Model: ypred=σ(wx+b)y_{pred} = \sigma(wx + b)

The Code

def sigmoid(z): return 1 / (1 + np.exp(-z))

# Training data (XOR-like)
X = np.array([0, 1])
y = np.array([0, 1])

# Initialize
w, b = 0.5, 0.0
lr = 1.0

print("Training Neural Network...")
for epoch in range(100):
    total_grad_w = 0
    total_grad_b = 0
    
    for i in range(len(X)):
        # Forward
        z = w * X[i] + b
        pred = sigmoid(z)
        
        # Backward (Chain Rule!)
        error = pred - y[i]
        dL_dpred = 2 * error
        dpred_dz = pred * (1 - pred)
        dz_dw = X[i]
        dz_db = 1
        
        grad_w = dL_dpred * dpred_dz * dz_dw
        grad_b = dL_dpred * dpred_dz * dz_db
        
        # Accumulate gradients
        total_grad_w += grad_w
        total_grad_b += grad_b
    
    # Update weights (Gradient Descent Step)
    w = w - lr * (total_grad_w / len(X))
    b = b - lr * (total_grad_b / len(X))

print(f"Final Weights: w={w:.2f}, b={b:.2f}")
print(f"Prediction for 0: {sigmoid(w*0 + b):.2f}")
print(f"Prediction for 1: {sigmoid(w*1 + b):.2f}")
This is Deep Learning. It’s just Gradient Descent applied to a lot of weights!

Learning Rate: The “Goldilocks” Problem

Choosing the step size (α\alpha) is the most important decision you make.
Learning Rate: The Goldilocks Problem

1. Too Small (The Turtle)

  • Symptom: Loss decreases veeeery slowly.
  • Result: You run out of time/patience before reaching the bottom.

2. Too Large (The Grasshopper)

  • Symptom: Loss bounces around or even INCREASES.
  • Result: You overshoot the valley and never converge.

3. Just Right (Goldilocks)

  • Symptom: Loss decreases steadily and quickly.
  • Result: You reach the minimum efficiently.

How to Find It?

Start with 0.01 or 0.001. If loss is slow, increase it (0.1). If loss explodes, decrease it (0.0001).
The Learning Rate Finder (A Practical Technique)Rather than guessing, there is a systematic approach popularized by Leslie Smith and widely adopted in fast.ai. The idea is simple: start with a very small learning rate and exponentially increase it over one epoch, recording the loss at each step.
# Learning rate finder sketch
lr = 1e-7
lr_mult = 1.1  # Increase by 10% each step
lrs, losses = [], []

for batch in training_data:
    loss = train_one_step(model, batch, lr)
    lrs.append(lr)
    losses.append(loss)
    lr *= lr_mult
    if loss > 4 * min(losses):  # Loss is diverging
        break

# Plot lrs vs losses (log scale for lr)
# The best LR is roughly where loss is falling fastest
# (steepest negative slope on the curve)
The sweet spot is typically one order of magnitude below where the loss starts exploding. If the loss starts diverging at lr=0.1, try lr=0.01. This takes 60 seconds and saves hours of wasted training.
Numerical Pitfall: NaN and Inf During TrainingWhen your learning rate is too large, weight updates overshoot so badly that values exceed float64 range (10308\approx 10^{308}). The sequence is: large gradient times large learning rate produces a huge weight, which produces an even larger gradient next step, which produces infinity, which produces NaN when multiplied by zero in some activation.Once a single NaN appears in your weights, it propagates everywhere (NaN times anything is NaN), and recovery is impossible without reloading a checkpoint. This is why:
  1. Always log the loss every N steps and check for sudden spikes
  2. Use gradient clipping (max norm 1.0 is a safe default)
  3. Save checkpoints frequently so you can recover from divergence
  4. Use mixed precision carefully — float16 overflows at 65,504, much sooner than float64
If you see NaN loss, do not increase the number of training steps. Reduce the learning rate.

Variants of Gradient Descent

The Survey Analogy

To understand the three variants, think about conducting a political survey to decide your campaign strategy. Batch Gradient Descent is like surveying every single voter in the country before making any decision. You get a perfectly accurate picture, but it takes forever and costs a fortune. Stochastic Gradient Descent (SGD) is like asking one random person on the street and immediately adjusting your entire strategy based on their opinion. Cheap and fast, but wildly noisy — one grumpy respondent could send you in the wrong direction. Mini-Batch Gradient Descent is like polling a focus group of 32-256 people. Not perfect, but a reasonable estimate that you can act on quickly. This is what everyone uses in practice.

Batch Gradient Descent

Uses all data points to compute gradient:
# Compute gradient using ALL data
gradient = compute_gradient(all_data)
params = params - lr * gradient
Pros: Stable, smooth convergence
Cons: Slow for large datasets; must fit all data in memory

Stochastic Gradient Descent (SGD)

Uses one data point at a time:
for each data_point in dataset:
    gradient = compute_gradient(data_point)
    params = params - lr * gradient
Pros: Fast, can escape local minima due to noise
Cons: Noisy, unstable; poor GPU utilization (GPUs want parallel work)

Mini-Batch Gradient Descent

Uses small batches of data:
for batch in dataset.batches(batch_size=32):
    gradient = compute_gradient(batch)
    params = params - lr * gradient
Pros: Best of both worlds; efficient GPU utilization
Cons: Need to choose batch size (32-256 is usually a safe range)
This is what everyone uses in practice. When someone says “SGD” in a deep learning context, they almost always mean mini-batch SGD, not true single-sample SGD.
Batch Size and GeneralizationThere is a fascinating and somewhat counterintuitive finding: smaller batch sizes often lead to better generalization (test accuracy), even though larger batches give more accurate gradient estimates. The theory is that the noise from small batches acts as a form of implicit regularization, preventing the model from settling into sharp, narrow minima that do not generalize. The “large batch training problem” was a major research topic around 2017-2019, and solutions like learning rate warmup and LARS/LAMB optimizers were developed to make large batches work well.In practice, choose the largest batch size that fits in GPU memory, then scale the learning rate proportionally (linear scaling rule: double the batch, double the learning rate).

Convergence Criteria

When to Stop?

Option 1: Gradient is small
if np.linalg.norm(gradient) < 1e-6:
    break  # Converged!
Option 2: Loss stops improving
if abs(loss_new - loss_old) < 1e-6:
    break  # Converged!
Option 3: Maximum iterations
if epoch >= max_epochs:
    break  # Give up

Practice Exercises

Exercise 1: Implement Gradient Descent

# Minimize f(x) = x⁴ - 3x³ + 2
# TODO:
# 1. Compute the derivative
# 2. Implement gradient descent
# 3. Find the minimum
# 4. Try different learning rates

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises let you feel gradient descent working on real problems.

Exercise 1: Train a Linear Model by Hand 📈

Implement gradient descent to fit a line to data:
import numpy as np

# Dataset: House sizes and prices
sizes = np.array([1000, 1500, 2000, 2500, 3000])  # sq ft
prices = np.array([200, 280, 350, 400, 480])       # $1000s

# Model: price = w * size + b
# Loss: Mean Squared Error

# TODO:
# 1. Initialize w=0.1, b=50
# 2. Compute gradients dL/dw and dL/db
# 3. Run gradient descent for 1000 steps
# 4. Plot the learning curve (loss vs steps)
import numpy as np

# Data
sizes = np.array([1000, 1500, 2000, 2500, 3000])
prices = np.array([200, 280, 350, 400, 480])
n = len(sizes)

def predict(w, b):
    return w * sizes + b

def mse_loss(w, b):
    preds = predict(w, b)
    return np.mean((preds - prices) ** 2)

def gradients(w, b):
    preds = predict(w, b)
    errors = preds - prices
    dw = 2 * np.mean(errors * sizes)
    db = 2 * np.mean(errors)
    return dw, db

print("📈 Linear Regression with Gradient Descent")
print("=" * 55)

# Initialize
w, b = 0.1, 50
lr = 1e-7  # Small learning rate (sizes are big numbers)
history = []

print("\n🚀 Training Progress:")
print(f"{'Step':<8} {'w':<12} {'b':<12} {'Loss':<12}")
print("-" * 44)

for step in range(1001):
    loss = mse_loss(w, b)
    history.append(loss)
    
    if step % 200 == 0:
        print(f"{step:<8} {w:<12.6f} {b:<12.2f} {loss:<12.2f}")
    
    dw, db = gradients(w, b)
    w = w - lr * dw
    b = b - lr * db

print("\n✅ Final Model:")
print(f"   price = {w:.4f} × size + {b:.2f}")
print(f"   Final loss: {mse_loss(w, b):.2f}")

# Test predictions
print("\n🏠 Predictions vs Actual:")
print(f"   {'Size':<8} {'Actual':<10} {'Predicted':<10} {'Error':<10}")
print("-" * 40)
preds = predict(w, b)
for s, actual, pred in zip(sizes, prices, preds):
    print(f"   {s:<8} ${actual:>6}k    ${pred:>6.1f}k    {abs(actual-pred):>6.1f}k")

# Learning curve visualization
print("\n📉 Loss Reduction:")
for i, s in enumerate([0, 200, 500, 1000]):
    bar_len = int(50 * history[s] / history[0])
    bar = "█" * bar_len
    print(f"   Step {s:4}: {bar} {history[s]:.1f}")
Real-World Insight: This is exactly how scikit-learn’s LinearRegression.fit() works under the hood (though it uses closed-form solution when possible for speed).

Exercise 2: Learning Rate Experiments 🎛️

Explore how learning rate affects convergence:
import numpy as np

def f(x):
    """A simple valley: f(x) = x² - 4x + 5"""
    return x**2 - 4*x + 5

def gradient(x):
    return 2*x - 4

# TODO:
# 1. Try learning rates: 0.01, 0.1, 0.5, 1.0, 1.1
# 2. Run 20 steps from x=0
# 3. Observe: which converges? which diverges? which oscillates?
# 4. Find the "critical" learning rate where things break
import numpy as np

def f(x):
    return x**2 - 4*x + 5

def gradient(x):
    return 2*x - 4

def run_gd(lr, steps=20, start=0):
    x = start
    history = [x]
    for _ in range(steps):
        x = x - lr * gradient(x)
        history.append(x)
        if abs(x) > 1000:  # Diverged
            break
    return history

print("🎛️ Learning Rate Experiments")
print("=" * 55)
print("Target: x* = 2.0 (minimum of f(x) = x² - 4x + 5)")

learning_rates = [0.01, 0.1, 0.5, 0.9, 1.0, 1.1, 1.5]

print("\n📊 Results after 20 steps:")
print(f"{'LR':<8} {'Final x':<12} {'Distance to optimal':<20} {'Status'}")
print("-" * 60)

for lr in learning_rates:
    history = run_gd(lr)
    final_x = history[-1]
    distance = abs(final_x - 2.0)
    
    if abs(final_x) > 1000:
        status = "💥 DIVERGED"
    elif abs(final_x - 2.0) < 0.01:
        status = "✅ Converged"
    elif len(set([round(x, 4) for x in history[-5:]])) > 1 and distance < 1:
        status = "🔄 Oscillating"
    else:
        status = "⏳ Still converging"
    
    if abs(final_x) < 1000:
        print(f"{lr:<8} {final_x:<12.4f} {distance:<20.6f} {status}")
    else:
        print(f"{lr:<8} {'inf':>12} {'∞':>20} {status}")

# Detailed trajectory for key learning rates
print("\n📈 Detailed Trajectories (first 10 steps):")
for lr in [0.1, 0.9, 1.1]:
    history = run_gd(lr, steps=10)
    print(f"\n   LR = {lr}:")
    print(f"   " + " → ".join([f"{x:.2f}" for x in history[:8]]) + 
          (" → 💥" if abs(history[-1]) > 100 else ""))

# Critical learning rate analysis
print("\n🎯 Critical Learning Rate Analysis:")
print("   For f(x) = x², the 2nd derivative f''(x) = 2")
print("   Critical LR = 2/f''(x) = 2/2 = 1.0")
print("   - LR < 1.0: Converges")
print("   - LR = 1.0: Oscillates forever (x → 0 → 4 → 0 → 4...)")
print("   - LR > 1.0: Diverges!")
Real-World Insight: This is why learning rate is the most important hyperparameter in deep learning! Too small = slow training, too large = divergence. That’s why Adam optimizer auto-adapts the learning rate.

Exercise 3: Escaping Local Minima 🕳️

Navigate a function with multiple minima:
import numpy as np

def f(x):
    """Function with local minima: f(x) = x⁴ - 8x² + x"""
    return x**4 - 8*x**2 + x

def gradient(x):
    return 4*x**3 - 16*x + 1

# This function has:
# - Local minimum near x ≈ -2
# - Global minimum near x ≈ 2  
# - Local maximum near x ≈ 0

# TODO:
# 1. Start at x = -3 and run gradient descent
# 2. Do you reach the global minimum?
# 3. Try adding "momentum" to escape local minima
# 4. Try random restarts - which works better?
import numpy as np

def f(x):
    return x**4 - 8*x**2 + x

def gradient(x):
    return 4*x**3 - 16*x + 1

def gd_basic(start, lr=0.01, steps=200):
    x = start
    for _ in range(steps):
        x = x - lr * gradient(x)
    return x, f(x)

def gd_momentum(start, lr=0.01, momentum=0.9, steps=200):
    x = start
    velocity = 0
    for _ in range(steps):
        velocity = momentum * velocity + lr * gradient(x)
        x = x - velocity
    return x, f(x)

def random_restarts(n_restarts=10, lr=0.01, steps=200):
    best_x, best_f = None, float('inf')
    for _ in range(n_restarts):
        start = np.random.uniform(-4, 4)
        x, fx = gd_basic(start, lr, steps)
        if fx < best_f:
            best_x, best_f = x, fx
    return best_x, best_f

print("🕳️ Escaping Local Minima")
print("=" * 55)

# Find actual minima numerically
print("\n📍 Function Analysis:")
xs = np.linspace(-3, 3, 1000)
ys = [f(x) for x in xs]
local_min = xs[np.argmin(ys[:500])]  # Left half
global_min = xs[np.argmin(ys)]
print(f"   Local minimum: x ≈ {local_min:.2f}, f(x) = {f(local_min):.2f}")
print(f"   Global minimum: x ≈ {global_min:.2f}, f(x) = {f(global_min):.2f}")

# Test different methods
print("\n🧪 Experiment Results (starting at x = -3):")
print("-" * 55)

# Basic GD
x1, f1 = gd_basic(-3)
print(f"   Basic GD:      x = {x1:.4f}, f(x) = {f1:.4f}", 
      "✅ Global!" if x1 > 0 else "❌ Stuck local")

# Momentum
x2, f2 = gd_momentum(-3)
print(f"   With Momentum: x = {x2:.4f}, f(x) = {f2:.4f}",
      "✅ Global!" if x2 > 0 else "❌ Stuck local")

# Random restarts
x3, f3 = random_restarts(10)
print(f"   Random Restart: x = {x3:.4f}, f(x) = {f3:.4f}",
      "✅ Global!" if x3 > 0 else "❌ Stuck local")

# Starting position analysis
print("\n📊 Impact of Starting Position:")
print(f"   {'Start':<8} {'Basic GD':<15} {'Momentum':<15}")
print("-" * 40)
for start in [-3, -1, 0, 1, 3]:
    x_basic, f_basic = gd_basic(start)
    x_mom, f_mom = gd_momentum(start)
    print(f"   {start:<8} x={x_basic:>6.2f} ({f_basic:>6.2f})   x={x_mom:>6.2f} ({f_mom:>6.2f})")

print("\n💡 Key Insights:")
print("   - Starting position greatly affects which minimum you find")
print("   - Momentum helps escape shallow local minima")
print("   - Random restarts are simple but effective")
print("   - Real neural networks use all of these tricks!")
Real-World Insight: Deep learning uses all these tricks! Random initialization, momentum (in Adam optimizer), and sometimes explicit random restarts. That’s why training the same model twice can give different results!

Exercise 4: Mini-Batch Gradient Descent 📦

Implement mini-batch training on a larger dataset:
import numpy as np

# Generate a larger dataset
np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 3)  # 3 features
true_w = np.array([2.0, -1.5, 0.5])
y = X @ true_w + np.random.randn(n_samples) * 0.5

# TODO:
# 1. Implement full-batch gradient descent
# 2. Implement mini-batch GD with batch_size=32
# 3. Compare convergence speed (iterations to reach loss < 1.0)
# 4. Compare wall-clock time (which is faster per epoch?)
import numpy as np
import time

np.random.seed(42)
n_samples = 1000
X = np.random.randn(n_samples, 3)
true_w = np.array([2.0, -1.5, 0.5])
y = X @ true_w + np.random.randn(n_samples) * 0.5

def loss(w, X, y):
    return np.mean((X @ w - y) ** 2)

def gradient_full(w, X, y):
    """Gradient using all data"""
    errors = X @ w - y
    return 2 * X.T @ errors / len(y)

def gradient_batch(w, X_batch, y_batch):
    """Gradient using mini-batch"""
    errors = X_batch @ w - y_batch
    return 2 * X_batch.T @ errors / len(y_batch)

def full_batch_gd(X, y, lr=0.01, epochs=100):
    w = np.zeros(3)
    history = []
    
    start = time.time()
    for epoch in range(epochs):
        grad = gradient_full(w, X, y)
        w = w - lr * grad
        history.append(loss(w, X, y))
    elapsed = time.time() - start
    
    return w, history, elapsed

def mini_batch_gd(X, y, batch_size=32, lr=0.01, epochs=100):
    w = np.zeros(3)
    history = []
    n = len(y)
    
    start = time.time()
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(n)
        X_shuf = X[indices]
        y_shuf = y[indices]
        
        # Process mini-batches
        for i in range(0, n, batch_size):
            X_batch = X_shuf[i:i+batch_size]
            y_batch = y_shuf[i:i+batch_size]
            grad = gradient_batch(w, X_batch, y_batch)
            w = w - lr * grad
        
        history.append(loss(w, X, y))
    elapsed = time.time() - start
    
    return w, history, elapsed

print("📦 Mini-Batch vs Full-Batch Gradient Descent")
print("=" * 55)

# Run experiments
w_full, hist_full, time_full = full_batch_gd(X, y)
w_mini, hist_mini, time_mini = mini_batch_gd(X, y, batch_size=32)

print(f"\n📊 Results after 100 epochs:")
print("-" * 55)
print(f"{'Method':<15} {'Final Loss':<12} {'Time (s)':<10} {'Weights'}")
print("-" * 55)
print(f"{'Full Batch':<15} {hist_full[-1]:<12.4f} {time_full:<10.4f} {w_full.round(3)}")
print(f"{'Mini-Batch':<15} {hist_mini[-1]:<12.4f} {time_mini:<10.4f} {w_mini.round(3)}")
print(f"{'True Weights':<15} {'-':<12} {'-':<10} {true_w}")

# Convergence analysis
def epochs_to_threshold(history, threshold=1.0):
    for i, l in enumerate(history):
        if l < threshold:
            return i + 1
    return len(history)

e_full = epochs_to_threshold(hist_full)
e_mini = epochs_to_threshold(hist_mini)

print(f"\n⏱️ Epochs to reach loss < 1.0:")
print(f"   Full Batch: {e_full} epochs")
print(f"   Mini-Batch: {e_mini} epochs")

# Per-epoch analysis
print(f"\n📈 Loss at key epochs:")
print(f"   {'Epoch':<8} {'Full Batch':<15} {'Mini-Batch':<15}")
print("-" * 40)
for e in [1, 5, 10, 25, 50, 100]:
    print(f"   {e:<8} {hist_full[e-1]:<15.4f} {hist_mini[e-1]:<15.4f}")

print("\n💡 Key Insights:")
print("   - Mini-batch has noisier updates but often converges faster")
print("   - Mini-batch uses less memory (important for big data)")
print("   - The noise in mini-batch can help escape local minima")
print("   - This is why batch_size is a key hyperparameter!")
Real-World Insight: GPT-4 and other large language models are trained exclusively with mini-batch gradient descent. A single batch might process hundreds of examples, but it’s still a tiny fraction of the trillions of training tokens!

🎯 Optimizer Selection Guide: Which One Should You Use?

Common Mistake: Many beginners stick with vanilla SGD or randomly pick Adam. Understanding when to use which optimizer can save hours of training time!

Decision Flowchart

╔══════════════════════════════════════════╗
║ What's your use case?                         ║
╚══════════════════════════════════════════╝

     ┌─────┴───────────────────────────┐
     │                                   │
  Just starting /            Computer Vision /
  Quick prototype            Image models
     │                                   │
     ↓                                   ↓
  ┌─────────┐                    ┌─────────────┐
  │  Adam    │                    │ SGD + Momentum│
  │ lr=1e-3 │                    │    + Nesterov │
  └─────────┘                    └─────────────┘
     │                                   │
     │                                   │
  NLP / Transformers          Fine-tuning 
     │                       pre-trained
     ↓                                   │
  ┌─────────────┐                    │
  │ AdamW        │                    ↓
  │ (Adam +      │            ┌─────────────┐
  │  weight decay)│            │   AdamW     │
  └─────────────┘            │ lr=1e-5 to  │
                               │   1e-4      │
                               └─────────────┘

Optimizer Comparison Table

OptimizerBest ForLearning RateProsCons
SGDUnderstanding, simple models0.01-0.1Simple, predictableSlow, gets stuck
SGD + MomentumCV models (ResNet, etc.)0.01-0.1Fast, smoothNeeds tuning
AdamMost cases, quick prototypes1e-3 to 1e-4Works out of boxCan overfit
AdamWTransformers, NLP, fine-tuning1e-4 to 1e-5Best for LLMsSlightly slower
RMSpropRNNs, unstable gradients1e-3 to 1e-4Handles varying gradientsLess common now

Learning Rate Guidelines

# Starting learning rates by model type
lr_guidelines = {
    'linear_regression': 0.01,
    'small_neural_net': 0.001,
    'cnn_from_scratch': 0.01,  # with SGD
    'cnn_transfer_learning': 0.0001,
    'transformer_training': 0.0001,
    'llm_fine_tuning': 0.00001,  # Very small!
    'stable_diffusion': 0.00001,
}

print("Always start with these, then adjust based on:")
print("  - Loss not decreasing? Try 10x smaller")
print("  - Loss decreasing too slowly? Try 2-5x larger")
print("  - Loss oscillating? Try 2-10x smaller")

🚨 Real-World Challenge: Messy Training Data

Production Reality: Training data in the real world has issues that can completely break gradient descent!

Handling Noisy Labels

import numpy as np

# Simulated noisy data (10% of labels are wrong)
np.random.seed(42)
X = np.random.randn(1000, 10)
y_true = (X[:, 0] + X[:, 1] > 0).astype(float)
y_noisy = y_true.copy()
noise_mask = np.random.rand(1000) < 0.10  # 10% noise
y_noisy[noise_mask] = 1 - y_noisy[noise_mask]  # Flip labels

print(f"Noisy labels: {noise_mask.sum()} out of {len(y_true)}")

# Strategies to handle noisy labels:
# 1. Label Smoothing
def label_smoothing(y, epsilon=0.1):
    """Soften hard labels to reduce overconfidence on wrong labels."""
    return y * (1 - epsilon) + 0.5 * epsilon

# 2. Early Stopping (networks learn clean patterns first)
# 3. Mixup / Data Augmentation
# 4. Confident Learning (identify and remove noisy samples)

Handling Missing Features

# Real data often has missing values
X_with_missing = X.copy()
missing_mask = np.random.rand(*X.shape) < 0.05  # 5% missing
X_with_missing[missing_mask] = np.nan

print(f"Missing values: {missing_mask.sum()} out of {X.size}")

# Strategies:
# 1. Imputation before training
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_with_missing)

# 2. Mask and learn (for neural nets)
# Replace NaN with 0 and add a "missing indicator" feature
X_masked = np.nan_to_num(X_with_missing, nan=0)
missing_indicator = np.isnan(X_with_missing).astype(float)
X_augmented = np.hstack([X_masked, missing_indicator])

Handling Class Imbalance

# Example: Fraud detection (99% normal, 1% fraud)
y_imbalanced = np.zeros(10000)
y_imbalanced[:100] = 1  # Only 1% positive

print(f"Class distribution: {np.bincount(y_imbalanced.astype(int))}")

# Strategies:
# 1. Class Weights in Loss
class_weights = {0: 1.0, 1: 99.0}  # Weight minority class higher

# 2. Oversampling (SMOTE)
# 3. Undersampling
# 4. Focal Loss (reduces weight on easy examples)

def focal_loss(y_true, y_pred, gamma=2.0):
    """Focus on hard examples, ignore easy ones."""
    pt = y_true * y_pred + (1 - y_true) * (1 - y_pred)
    focal_weight = (1 - pt) ** gamma
    return -focal_weight * np.log(pt + 1e-8)
Rule of Thumb for Messy Data:
  1. Always visualize your data before training
  2. Check for outliers - they can dominate gradients
  3. Monitor gradient norms - sudden spikes indicate problems
  4. Use gradient clipping as a safety net

Key Takeaways

Gradient descent = iterative optimization algorithm
Follow gradient downhill = move opposite to gradient
Learning rate = critical hyperparameter
Mini-batch = best practice for large datasets
Powers all ML = from linear regression to GPT

Learning Rate Schedulers: Advanced Techniques

Pro tip: The best learning rate changes during training! Start high, then decrease.
SchedulerFormulaWhen to Use
Step Decayαt=α0γt/s\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}Simple, predictable
Exponentialαt=α0eλt\alpha_t = \alpha_0 \cdot e^{-\lambda t}Smooth decay
Cosine Annealingαt=αmin+12(α0αmin)(1+cos(tπT))\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_0 - \alpha_{min})(1 + \cos(\frac{t\pi}{T}))State-of-the-art
Warmup + DecayLinear increase then decreaseTransformers, LLMs
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR

# Example: Using schedulers in PyTorch
model = torch.nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Option 1: Step decay (multiply by 0.1 every 10 epochs)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Option 2: Cosine annealing (used by GPT, BERT)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# Option 3: One Cycle (super-convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.1, total_steps=1000)

# Training loop
for epoch in range(100):
    # ... training code ...
    optimizer.step()
    scheduler.step()  # Update learning rate

When Training Gets Stuck: Debugging Checklist

  1. Learning rate too high? Try 10x smaller
  2. Learning rate too low? Try 10x larger
  3. Data issue? Check for NaN/Inf values
  4. Bug in loss function? Verify on toy data
  1. Reduce learning rate by 2-10x
  2. Add gradient clipping
  3. Check for exploding gradients
  4. Increase batch size for smoother gradients
  1. Use learning rate scheduler
  2. Add momentum or switch to Adam
  3. Check if you’ve converged (that’s good!)
  4. Try data augmentation
  1. Overfitting - add regularization
  2. Reduce model complexity
  3. Add dropout
  4. Get more data

What’s Next?

Gradient descent is powerful, but basic. Can we do better? Can we converge faster? Can we escape local minima? Yes! Advanced optimization techniques like Momentum, Adam, and RMSprop!

Next: Optimization Techniques

Learn the optimizers that power modern deep learning

Interview Deep-Dive

Strong Answer:
  • The most practical technique is the learning rate range test (also called the LR finder), popularized by Leslie Smith. You start with a very small learning rate (say 1e-7), train for a few hundred steps, and exponentially increase the learning rate at each step up to a large value (say 10). Plot loss versus learning rate on a log scale. The optimal learning rate is typically at the steepest descent of the curve — just before the loss starts increasing or oscillating. In my experience, this takes about 5 minutes of compute and saves hours of trial-and-error.
  • A rough heuristic that works surprisingly often for Adam: start with 3e-4. For SGD with momentum, start with 0.1 and decay. For fine-tuning pretrained models, use 10-100x smaller than the original training rate, typically 1e-5 to 5e-5.
  • The deeper understanding: the maximum stable learning rate is related to the curvature of the loss landscape. Specifically, for a quadratic loss with maximum eigenvalue of the Hessian equal to lambda_max, the learning rate must be less than 2/lambda_max for gradient descent to converge. In practice, you do not compute the Hessian, but the LR finder empirically discovers this boundary.
  • For production systems, I use learning rate schedules rather than a single fixed rate. Warmup (start small, ramp up over 1000-5000 steps) stabilizes early training when the loss surface is poorly conditioned. Then cosine decay or linear decay reduces the rate as training progresses. The combination of warmup + cosine decay has become the default for transformer training (GPT, BERT, and most LLMs use this pattern).
Follow-up: Why does learning rate warmup help stabilize training? What would go wrong without it?At the start of training, weights are randomly initialized and the loss surface is typically poorly conditioned — some directions have very high curvature while others are nearly flat. A large learning rate at this stage causes the optimizer to take huge steps along high-curvature directions, which can send weights into unstable regions or even cause immediate divergence. This is especially pronounced with Adam, because the running averages of gradients (m) and squared gradients (v) are initialized to zero. In the first few steps, the bias correction amplifies the update magnitude, and combining this with a large learning rate is a recipe for instability. Warmup gives the optimizer time to build up accurate gradient statistics and for the loss surface to “settle” into a more well-conditioned region. Without warmup, I have seen large transformer models diverge within the first 100 steps, while the same model with 2000 steps of linear warmup trains perfectly. Google’s original BERT paper used 10,000 warmup steps out of 1,000,000 total — just 1% of training, but essential for stability.
Strong Answer:
  • Batch gradient descent computes the gradient over the entire training set before making one update. The gradient is the exact average gradient, so updates are smooth and stable. But for a dataset of N samples, every single step costs N forward-backward passes. For ImageNet with 1.2 million images, that is prohibitively slow — one step might take hours.
  • Stochastic gradient descent (SGD with batch size 1) updates after every single sample. This is extremely noisy — the gradient from one sample is a very rough approximation of the true gradient. But you make N updates per epoch instead of 1, so convergence in wall-clock time is often faster. The noise is actually beneficial: it helps escape sharp local minima and acts as an implicit regularizer.
  • Mini-batch SGD is the practical sweet spot. You compute the gradient over B samples (typically 16-512) and update. The gradient variance decreases as 1/B, so larger batches give smoother updates. But there are diminishing returns: doubling the batch size halves the variance but doubles the compute per step. The optimal batch size balances gradient quality with compute efficiency.
  • A critical practical consideration: GPU utilization. GPUs are massively parallel — processing 32 samples costs barely more than processing 1 because the matrix multiplications are parallelized. So increasing batch size from 1 to 32 gives 32x more gradient information at maybe 2x the wall-clock time. Beyond some threshold (often 256-1024 depending on model and hardware), the GPU saturates and larger batches give diminishing throughput.
  • The generalization trade-off: Keskar et al. (2017) showed that large-batch training tends to converge to sharp minima that generalize poorly, while small-batch training finds flatter minima that generalize better. This has been partially addressed by learning rate scaling rules (linear scaling by Goyal et al. 2017: multiply LR by batch_size/base_batch_size) and LARS/LAMB optimizers designed for very large batch training.
Follow-up: If noise in SGD helps escape sharp minima, why not just add artificial noise to batch gradient descent?People do exactly this, and it is called Langevin dynamics or noisy gradient descent. You add Gaussian noise to the gradient: g_noisy = g_true + N(0, sigma^2). However, the noise in mini-batch SGD has special structure — it is not isotropic random noise. The mini-batch gradient noise is correlated with the loss landscape geometry. Specifically, the noise covariance is proportional to the gradient covariance across samples, which tends to be larger in sharp directions of the loss surface. This means SGD noise naturally provides more perturbation in precisely the directions where the landscape is most problematic. Artificial isotropic noise does not have this property. That said, for certain theoretical analyses and Bayesian deep learning (e.g., stochastic gradient Langevin dynamics), adding calibrated noise to the gradient is a principled technique for approximate posterior sampling.
Strong Answer:
  • First, I would challenge the assumption. Equal train and validation accuracy at 85% means the model is not overfitting — it is underfitting. The issue might not be a local minimum at all. It could be that the model lacks capacity (too small), the features are insufficiently expressive, or the data itself has an 85% accuracy ceiling (label noise, inherently ambiguous examples).
  • To diagnose whether optimization is the bottleneck versus model capacity: overfit a tiny subset. Take 100 examples and train until the loss is near zero. If the model can memorize 100 examples perfectly, then capacity is not the issue — the problem is optimization or data. If it cannot even memorize 100 examples, the architecture needs to change.
  • If the model overfits the tiny subset but plateaus at 85% on the full dataset, optimization is likely the bottleneck. I would check: (1) gradient norms — if they are near zero, you are at a critical point. (2) Loss landscape visualization — project the loss along random directions to see if you are in a flat region versus a true minimum. (3) Try different learning rates — a larger rate might push you out of a local minimum. (4) Switch optimizers — Adam with its adaptive rates might navigate the landscape differently than SGD.
  • Specific interventions for escaping local minima: learning rate restarts (cyclical learning rates or warm restarts, as in SGDR by Loshchilov and Hutter), which periodically spike the learning rate to “shake” the optimizer out of basins. Increasing batch size during training (reverse of the typical schedule) can also sharpen the gradient signal and push toward different basins. Finally, adding noise to parameters (shaking the weights directly) or using stochastic weight averaging (SWA) to average weights across multiple points in the optimization trajectory can find wider, more generalizable basins.
  • In my experience, the most common fix for plateauing models is not fancy optimization tricks but rather: better data augmentation, larger models, or longer training. The optimization landscape of modern neural networks is surprisingly benign — most local minima have similar loss values. Plateaus are more often caused by data or architecture limitations than by optimization failure.
Follow-up: You mentioned stochastic weight averaging (SWA). How does averaging weights from different training iterations improve generalization, and is this mathematically justified?SWA works by collecting weight snapshots during training (typically during the last portion with a cyclical or constant learning rate) and averaging them. Geometrically, the individual snapshots may be in different local minima along a flat basin, and averaging them tends to land at a point near the center of the basin — which is flatter and broader. Flatter minima correlate with better generalization because small perturbations to the weights (which happen due to train/test distribution shift) cause smaller changes in predictions. The mathematical justification comes from the connection between flatness and the Hessian: at the SWA solution, the Hessian eigenvalues tend to be smaller than at any individual snapshot, meaning the loss surface is more gently curved. Practically, SWA often gives 1-2% accuracy improvement on ImageNet for essentially zero extra training cost — you are just averaging checkpoints you already computed.
Strong Answer:
  • The linear scaling rule says: if you multiply batch size by k, multiply learning rate by k. The intuition is that with a batch k times larger, your gradient estimate has k times less variance (standard error scales as 1/sqrt(B), but you are taking one step instead of k steps, so the net effect on expected weight change per epoch should be preserved).
  • More precisely: with batch size B and learning rate eta, the expected weight change per epoch is proportional to eta * (N/B) * g_bar, where N is dataset size, N/B is steps per epoch, and g_bar is the average gradient. If you double B to 2B, steps per epoch halves to N/(2B), so to maintain the same expected weight change, you double eta to 2*eta.
  • This works well up to a point. Goyal et al. (2017) at Facebook successfully trained ImageNet with batch sizes up to 8192 using linear scaling. But beyond a critical batch size (which is problem-dependent), the linear scaling rule breaks down. With extremely large batches, you need sublinear scaling (like square root scaling: LR proportional to sqrt(B)), or specialized optimizers like LARS (layer-wise adaptive rate scaling) and LAMB.
  • The reason it breaks down: very large batches reduce gradient noise so much that the optimizer converges to sharp minima that generalize poorly. The noise from small batches is not just an annoyance — it serves as a regularizer. Killing that noise with huge batches can hurt test accuracy even when training accuracy improves.
  • A practical consideration: the learning rate warmup period should also scale with batch size. Larger batches need longer warmup because the initial gradient statistics are noisier relative to the larger step sizes. The BERT training recipe uses warmup proportional to total training steps, which implicitly accounts for batch size.
Follow-up: Google and others train with batch sizes of 32K or even 64K. How do they maintain generalization quality at these extreme batch sizes?Several techniques work in concert. First, LARS and LAMB optimizers compute separate learning rates for each layer, normalized by the layer’s weight norm divided by its gradient norm. This prevents any single layer from taking a disproportionately large step, which is the main failure mode of large-batch training with a global learning rate. Second, longer warmup — often 5-10% of total training. Third, explicit regularization to compensate for the lost stochastic noise: stronger weight decay, dropout, or label smoothing. Fourth, gradient accumulation as an alternative: instead of a true 64K batch requiring 64K samples in GPU memory simultaneously, you accumulate gradients over multiple smaller forward-backward passes before updating. This gives the same mathematical effect as a large batch but fits in limited GPU memory. The trade-off is wall-clock time, since you cannot parallelize the accumulation steps.