Skip to main content
Optimization Techniques

Optimization Techniques

Your Challenge: The Valley of Deceit

Standard Gradient Descent is like walking downhill blindfolded. It works great on a smooth, simple hill. But real-world loss landscapes are treacherous.
  • Local Minima: Small dips that look like the bottom but aren’t.
  • Saddle Points: Flat areas where you get stuck.
  • Ravines: Steep walls where you bounce back and forth.
Your Goal: Navigate this treacherous terrain to find the true global minimum, fast. You need better equipment than just “walking downhill”. You need Momentum and Adaptive Steps.

Momentum: The Heavy Ball

The Intuition

Imagine rolling a ping-pong ball down a bumpy hill. It gets stuck in every little pothole (Local Minimum). Now imagine rolling a heavy bowling ball.
  • It gains speed.
  • When it hits a small pothole, its momentum carries it right through.
  • It eventually settles in the deepest valley.
Momentum Ball Analogy

The Math

Instead of just following the gradient, we keep a “velocity” (vv) that accumulates speed. vnew=βvold+(1β)f(x)xnew=xoldαvnew\begin{align} v_{new} &= \beta \cdot v_{old} + (1 - \beta) \cdot \nabla f(x) \\ x_{new} &= x_{old} - \alpha \cdot v_{new} \end{align}
  • β\beta: Friction (usually 0.9). Retains 90% of speed.
  • vv: Velocity.

The Code

import numpy as np

# A function with a local minimum at x=-2 and global at x=2
def f(x): return 0.1*x**4 - 3*x**2 + x
def grad(x): return 0.4*x**3 - 6*x + 1

# 1. Standard SGD (Gets stuck)
x = -3.0
lr = 0.1
for _ in range(20):
    x = x - lr * grad(x)
print(f"SGD stuck at x={x:.2f}")  # ~ -2.0 (Local Min)

# 2. Momentum (Escapes!)
x = -3.0
v = 0.0
beta = 0.9
for _ in range(20):
    v = beta * v + (1 - beta) * grad(x)
    x = x - lr * v
print(f"Momentum reached x={x:.2f}") # ~ 2.0 (Global Min)
Key Insight: Momentum helps you blast through small traps and speed up on flat surfaces!

RMSprop & Adam: Adaptive Shoes

The Problem with Ravines

Imagine a narrow ravine.
  • Steep walls (High gradient in yy direction).
  • Gentle slope towards the sea (Low gradient in xx direction).
If you take big steps, you bounce off the walls (yy) and never move forward (xx). If you take small steps, you move forward (xx) but it takes forever. Solution: Wear Adaptive Shoes.
  • If the ground is steep (yy), take tiny steps.
  • If the ground is flat (xx), take huge steps.

Adam (Adaptive Moment Estimation)

Adam combines both ideas:
  1. Momentum: Keep moving forward (Velocity).
  2. RMSprop: Adapt step size based on terrain steepness (Variance).
It is the “Gold Standard” optimizer in Deep Learning today.

The Code (Using PyTorch)

You rarely implement Adam from scratch. You use a library.
import torch
import torch.nn as nn
import torch.optim as optim

# 1. Define your model
model = nn.Linear(10, 1)

# 2. Choose your optimizer
# SGD
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01)

# Momentum
optimizer_mom = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (The Best)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# 3. Training Loop
# optimizer.zero_grad()
# loss.backward()
# optimizer.step()  # This applies the math!

Comparison: Who Wins?

OptimizerAnalogyBest For
SGDDrunk walkerSimple problems
MomentumBowling ballNoisy gradients, Local minima
AdamSmart RobotAlmost everything (Default choice)

Visual Comparison

Optimizer Comparison If we race them on a complex terrain:
  1. SGD: Stumbles, gets stuck.
  2. Momentum: Overshoots but eventually settles.
  3. Adam: Beelines straight for the goal.

Practice Exercise: Escape the Trap

The Scenario

You are training a model that keeps getting stuck at 80% accuracy.
  • Loss isn’t going down.
  • Gradients are small but not zero.
Diagnosis: You are likely in a Saddle Point or Local Minimum. Your Task: Switch from SGD to Adam and observe the difference.
# Pseudo-code for your experiment
model = MyNeuralNet()
criterion = nn.MSELoss()

# Experiment A: SGD
opt_a = optim.SGD(model.parameters(), lr=0.01)
train(model, opt_a) # Result: 80% acc

# Experiment B: Adam
opt_b = optim.Adam(model.parameters(), lr=0.001)
train(model, opt_b) # Result: 95% acc!
Takeaway: Changing the optimizer is often the easiest way to improve your model!

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises let you experience why modern optimizers matter.

Exercise 1: Optimizer Shootout 🏁

Race different optimizers on the same problem:
import numpy as np

# Rosenbrock function - a classic optimization challenge
# f(x, y) = (1-x)² + 100(y-x²)²
# Minimum at (1, 1)

# Starting point: (-2, 2)
# This is a "banana-shaped" valley - hard for basic GD!

# TODO:
# 1. Implement SGD, Momentum, and Adam
# 2. Run each for 5000 steps
# 3. Compare: distance to optimum, path length, convergence speed
import numpy as np

def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    dx = -2*(1-x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

def run_sgd(start, lr=0.001, steps=5000):
    pos = np.array(start, dtype=float)
    history = [pos.copy()]
    for _ in range(steps):
        grad = rosenbrock_grad(*pos)
        pos -= lr * grad
        history.append(pos.copy())
    return history

def run_momentum(start, lr=0.001, beta=0.9, steps=5000):
    pos = np.array(start, dtype=float)
    velocity = np.zeros(2)
    history = [pos.copy()]
    for _ in range(steps):
        grad = rosenbrock_grad(*pos)
        velocity = beta * velocity + grad
        pos -= lr * velocity
        history.append(pos.copy())
    return history

def run_adam(start, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=5000):
    pos = np.array(start, dtype=float)
    m = np.zeros(2)
    v = np.zeros(2)
    history = [pos.copy()]
    for t in range(1, steps + 1):
        grad = rosenbrock_grad(*pos)
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        pos -= lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(pos.copy())
    return history

print("🏁 Optimizer Shootout: Rosenbrock Function")
print("=" * 55)
print("Target: (1, 1) | Start: (-2, 2)")

start = [-2.0, 2.0]
optimal = np.array([1.0, 1.0])

histories = {
    "SGD": run_sgd(start),
    "Momentum": run_momentum(start),
    "Adam": run_adam(start)
}

print("\n📊 Results after 5000 steps:")
print("-" * 55)
print(f"{'Optimizer':<12} {'Final Position':<20} {'Distance':<10} {'Final Loss'}")
print("-" * 55)

for name, hist in histories.items():
    final = hist[-1]
    dist = np.linalg.norm(final - optimal)
    loss = rosenbrock(*final)
    print(f"{name:<12} ({final[0]:>7.4f}, {final[1]:>7.4f})   {dist:<10.6f} {loss:.6f}")

# Convergence speed analysis
print("\n⏱️ Steps to reach distance < 0.1 from optimum:")
for name, hist in histories.items():
    for i, pos in enumerate(hist):
        if np.linalg.norm(pos - optimal) < 0.1:
            print(f"   {name}: {i} steps")
            break
    else:
        print(f"   {name}: Never reached (final dist: {np.linalg.norm(hist[-1] - optimal):.4f})")

# Path analysis
print("\n📈 Path Length (total distance traveled):")
for name, hist in histories.items():
    path_length = sum(np.linalg.norm(np.array(hist[i+1]) - np.array(hist[i])) 
                     for i in range(min(1000, len(hist)-1)))
    print(f"   {name}: {path_length:.2f}")

print("\n💡 Key Insights:")
print("   - SGD gets stuck in the curved valley")
print("   - Momentum helps but can overshoot")
print("   - Adam adapts step sizes and finds the optimum!")
Real-World Insight: The Rosenbrock function is a classic benchmark. Real neural network loss landscapes are even more complex - that’s why Adam is the default optimizer in most frameworks!

Exercise 2: Learning Rate Scheduling 📅

Implement and compare learning rate schedules:
import numpy as np

def quadratic(x):
    return (x - 5)**2

def grad(x):
    return 2*(x - 5)

# TODO:
# 1. Implement constant LR (lr = 0.1)
# 2. Implement step decay (halve every 100 steps)
# 3. Implement exponential decay (lr = lr0 * 0.99^step)
# 4. Implement cosine annealing
# 5. Compare convergence and stability
import numpy as np

def f(x):
    return (x - 5)**2

def grad(x):
    return 2*(x - 5)

def train_with_schedule(schedule_fn, steps=500, x0=0):
    x = x0
    history = [(x, f(x))]
    for step in range(steps):
        lr = schedule_fn(step)
        x = x - lr * grad(x)
        history.append((x, f(x)))
    return history

# Learning rate schedules
def constant(step, lr0=0.1):
    return lr0

def step_decay(step, lr0=0.5, decay_rate=0.5, decay_steps=100):
    return lr0 * (decay_rate ** (step // decay_steps))

def exponential(step, lr0=0.5, decay=0.99):
    return lr0 * (decay ** step)

def cosine_annealing(step, lr0=0.5, T=500):
    return lr0 * 0.5 * (1 + np.cos(np.pi * step / T))

def warmup_cosine(step, lr0=0.5, warmup_steps=50, T=500):
    if step < warmup_steps:
        return lr0 * step / warmup_steps
    return lr0 * 0.5 * (1 + np.cos(np.pi * (step - warmup_steps) / (T - warmup_steps)))

print("📅 Learning Rate Schedule Comparison")
print("=" * 55)

schedules = {
    "Constant (0.1)": lambda s: constant(s),
    "Step Decay": lambda s: step_decay(s),
    "Exponential": lambda s: exponential(s),
    "Cosine Annealing": lambda s: cosine_annealing(s),
    "Warmup + Cosine": lambda s: warmup_cosine(s),
}

print("\n📊 Training Results (500 steps, target x=5):")
print("-" * 55)
print(f"{'Schedule':<20} {'Final x':<12} {'Final Loss':<12} {'Converged at'}")
print("-" * 55)

for name, sched in schedules.items():
    hist = train_with_schedule(sched)
    final_x, final_loss = hist[-1]
    
    # Find when it first got close
    converge_step = None
    for i, (x, loss) in enumerate(hist):
        if abs(x - 5) < 0.01:
            converge_step = i
            break
    
    conv_str = f"step {converge_step}" if converge_step else "Never"
    print(f"{name:<20} {final_x:<12.6f} {final_loss:<12.8f} {conv_str}")

# Learning rate visualization
print("\n📈 Learning Rate Over Time:")
print("   Step | Constant | StepDecay | Exponent | Cosine  ")
print("   -----|----------|-----------|----------|--------")
for step in [0, 50, 100, 200, 300, 400, 499]:
    c = constant(step)
    s = step_decay(step)
    e = exponential(step)
    cos = cosine_annealing(step)
    print(f"   {step:4} | {c:8.4f} | {s:9.4f} | {e:8.4f} | {cos:8.4f}")

print("\n💡 Key Insights:")
print("   - Constant LR: Simple but may oscillate near optimum")
print("   - Step decay: Sudden drops can cause instability")
print("   - Exponential: Smooth decay, widely used")
print("   - Cosine: Modern favorite, smooth and goes to zero")
print("   - Warmup: Helps with unstable early gradients")
Real-World Insight: BERT, GPT, and most modern transformers use warmup + cosine (or linear) decay. The warmup phase is crucial for training stability with adaptive optimizers!

Exercise 3: Batch Size Trade-offs ⚖️

Explore the relationship between batch size and training:
import numpy as np

# Generate regression data
np.random.seed(42)
n = 10000
X = np.random.randn(n, 5)
true_w = np.array([1, -2, 0.5, 3, -1])
y = X @ true_w + np.random.randn(n) * 0.5

# TODO:
# 1. Train with batch sizes: 1, 16, 64, 256, full
# 2. For each: measure steps to convergence, final accuracy, variance in updates
# 3. Plot the gradient variance for different batch sizes
# 4. Find the "sweet spot" batch size for this problem
import numpy as np

np.random.seed(42)
n = 10000
X = np.random.randn(n, 5)
true_w = np.array([1.0, -2.0, 0.5, 3.0, -1.0])
y = X @ true_w + np.random.randn(n) * 0.5

def loss(w):
    return np.mean((X @ w - y) ** 2)

def gradient_full(w):
    return 2 * X.T @ (X @ w - y) / n

def gradient_batch(w, batch_idx):
    X_b = X[batch_idx]
    y_b = y[batch_idx]
    return 2 * X_b.T @ (X_b @ w - y_b) / len(batch_idx)

def train(batch_size, lr=0.01, max_epochs=50):
    w = np.zeros(5)
    losses = []
    grad_variances = []
    
    for epoch in range(max_epochs):
        if batch_size >= n:
            grad = gradient_full(w)
            w = w - lr * grad
            grad_variances.append(0)  # No variance in full batch
        else:
            indices = np.random.permutation(n)
            epoch_grads = []
            for i in range(0, n, batch_size):
                batch_idx = indices[i:i+batch_size]
                grad = gradient_batch(w, batch_idx)
                epoch_grads.append(grad)
                w = w - lr * grad
            grad_variances.append(np.var(epoch_grads))
        
        losses.append(loss(w))
        if losses[-1] < 0.3:  # Converged
            break
    
    return w, losses, grad_variances

print("⚖️ Batch Size Trade-offs")
print("=" * 55)

batch_sizes = [1, 16, 64, 256, 1024, n]
batch_names = ["1 (SGD)", "16", "64", "256", "1024", "Full"]

print("\n📊 Training Results (max 50 epochs):")
print("-" * 65)
print(f"{'Batch Size':<12} {'Epochs':<8} {'Final Loss':<12} {'Weight Error':<15} {'Avg Grad Var'}")
print("-" * 65)

for bs, name in zip(batch_sizes, batch_names):
    lr = 0.01 if bs < 256 else 0.1  # Larger LR for larger batches
    w, losses, gv = train(bs, lr=lr)
    epochs = len(losses)
    final_loss = losses[-1]
    w_error = np.linalg.norm(w - true_w)
    avg_var = np.mean(gv) if gv and gv[0] > 0 else 0
    
    print(f"{name:<12} {epochs:<8} {final_loss:<12.6f} {w_error:<15.6f} {avg_var:.6f}")

# Gradient variance analysis
print("\n📈 Gradient Variance vs Batch Size:")
print("   (Lower variance = more stable updates)")
for bs, name in zip([1, 16, 64, 256], ["1", "16", "64", "256"]):
    w = np.zeros(5)
    variances = []
    for _ in range(100):
        batch_idx = np.random.choice(n, bs, replace=False)
        grad = gradient_batch(w, batch_idx)
        variances.append(np.linalg.norm(grad))
    
    variance = np.var(variances)
    bar = "█" * int(variance * 5)
    print(f"   BS={name:>4}: {bar} {variance:.4f}")

print("\n💡 Key Insights:")
print("   - Small batches: High variance but can escape local minima")
print("   - Large batches: Low variance but may need higher LR")
print("   - Sweet spot (16-64): Balance of speed and stability")
print("   - GPUs favor powers of 2 (32, 64, 128, 256)")
Real-World Insight: Google’s “Large Batch Training” research showed you can use batch sizes of 32K+ with proper learning rate scaling. This enables training GPT-4 class models in days instead of months!

Exercise 4: Adam from Scratch 🔧

Implement the Adam optimizer and understand each component:
import numpy as np

# Implement Adam optimizer for a 2D problem
def f(x, y):
    return 0.5*x**2 + 5*y**2  # Elliptical bowl

def grad(x, y):
    return np.array([x, 10*y])

# TODO:
# 1. Implement Adam with m (momentum), v (RMSprop), and bias correction
# 2. Compare Adam to vanilla SGD on this problem
# 3. Visualize how m and v evolve during training
# 4. Show why Adam handles the different curvatures in x and y
import numpy as np

def f(x, y):
    return 0.5*x**2 + 5*y**2

def grad(x, y):
    return np.array([x, 10*y])

def sgd(start, lr=0.1, steps=100):
    pos = np.array(start, dtype=float)
    history = [pos.copy()]
    for _ in range(steps):
        g = grad(*pos)
        pos -= lr * g
        history.append(pos.copy())
    return history

def adam(start, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    pos = np.array(start, dtype=float)
    m = np.zeros(2)  # First moment (momentum)
    v = np.zeros(2)  # Second moment (RMSprop)
    
    history = [pos.copy()]
    m_history = [m.copy()]
    v_history = [v.copy()]
    
    for t in range(1, steps + 1):
        g = grad(*pos)
        
        # Update moments
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        
        # Bias correction
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        
        # Update position
        pos -= lr * m_hat / (np.sqrt(v_hat) + eps)
        
        history.append(pos.copy())
        m_history.append(m.copy())
        v_history.append(v.copy())
    
    return history, m_history, v_history

print("🔧 Adam from Scratch")
print("=" * 55)
print("Function: f(x,y) = 0.5x² + 5y² (ellipse, steep in y)")
print("Start: (10, 1)")

start = [10.0, 1.0]

sgd_hist = sgd(start)
adam_hist, m_hist, v_hist = adam(start)

print("\n📊 Comparison (100 steps):")
print("-" * 55)
print(f"{'Step':<6} {'SGD Position':<25} {'Adam Position'}")
print("-" * 55)

for step in [0, 10, 25, 50, 100]:
    sgd_pos = sgd_hist[step]
    adam_pos = adam_hist[step]
    print(f"{step:<6} ({sgd_pos[0]:>8.4f}, {sgd_pos[1]:>8.4f})      ({adam_pos[0]:>8.4f}, {adam_pos[1]:>8.4f})")

print("\n🔬 Adam's Internal State Evolution:")
print(f"{'Step':<6} {'Gradient':<20} {'m (momentum)':<20} {'v (variance)'}")
print("-" * 70)
for step in [1, 5, 10, 25]:
    g = grad(*adam_hist[step-1])
    m = m_hist[step]
    v = v_hist[step]
    print(f"{step:<6} ({g[0]:>6.2f}, {g[1]:>6.2f})     ({m[0]:>6.4f}, {m[1]:>6.4f})     ({v[0]:>6.4f}, {v[1]:>6.4f})")

print("\n📈 Effective Learning Rates (Adam adapts per-dimension!):")
print(f"{'Step':<6} {'LR for x':<15} {'LR for y':<15}")
print("-" * 40)
for step in [1, 10, 50]:
    m = m_hist[step]
    v = v_hist[step]
    m_hat = m / (1 - 0.9**step)
    v_hat = v / (1 - 0.999**step)
    
    lr_x = 0.1 / (np.sqrt(v_hat[0]) + 1e-8)
    lr_y = 0.1 / (np.sqrt(v_hat[1]) + 1e-8)
    print(f"{step:<6} {lr_x:<15.6f} {lr_y:<15.6f}")

print("\n💡 Key Insights:")
print("   - Adam takes larger steps in x (flat direction)")
print("   - Adam takes smaller steps in y (steep direction)")
print("   - m provides momentum for consistent direction")
print("   - v provides per-dimension scaling")
print("   - Bias correction is crucial in early steps!")
Real-World Insight: Adam is the default optimizer in PyTorch and TensorFlow. Understanding its components helps you debug training issues - if loss is oscillating, you might need lower β1; if it’s not decreasing, you might need higher learning rate!

What’s Next?

You have mastered the core math of learning!
  1. Derivatives: How things change.
  2. Gradients: The direction of change.
  3. Chain Rule: How changes propagate.
  4. Gradient Descent: How to learn.
  5. Optimization: How to learn fast.
Now, you are ready to build something real.

Quick Reference: Optimizer Selection Guide

Bookmark this! Use it when starting a new project.
ScenarioRecommended OptimizerSettings
Default starting pointAdamlr=0.001
Computer Vision (CNN)SGD + Momentumlr=0.1, momentum=0.9 + scheduler
Transformers/LLMsAdamWlr=1e-4 to 3e-4, weight_decay=0.01
Fine-tuning pre-trainedAdamWlr=2e-5 to 5e-5
Small datasetAdamlr=0.001
Reinforcement LearningAdamlr=3e-4
GANsAdamlr=0.0002, betas=(0.5, 0.999)
Sparse gradients (embeddings)SparseAdamlr=0.001

Hyperparameter Tuning Priority

1. Learning Rate    ← Most important! Try [1e-4, 1e-3, 1e-2, 0.1]
2. Batch Size       ← 16-64 for small data, 256-1024 for large
3. Momentum/Beta1   ← Usually 0.9 is fine
4. Beta2            ← Usually 0.999 is fine
5. Weight Decay     ← Try [0, 1e-4, 1e-2] if overfitting

Interview Questions: Optimizers

Answer: Adam combines momentum (averages gradients) with adaptive learning rates (scales by gradient variance). This helps it:
  • Handle sparse gradients well
  • Converge faster with less tuning
  • Adapt to each parameter’s needs
However, SGD + momentum often generalizes better for computer vision - it’s just harder to tune!
Answer: AdamW decouples weight decay from the gradient update. In vanilla Adam, weight decay is applied to the gradient, which interacts poorly with the adaptive learning rate. AdamW applies weight decay directly to the weights:
# Adam (problematic)
w -= lr * (gradient + weight_decay * w) / sqrt(v)

# AdamW (correct)  
w -= lr * gradient / sqrt(v)
w -= lr * weight_decay * w
Answer:
  1. First, verify the model can overfit a tiny subset (10 examples). If not, there’s a bug.
  2. Check for NaN/Inf in gradients and activations
  3. Try different learning rates (10x higher and 10x lower)
  4. Check data preprocessing (normalization, labels)
  5. Verify loss function is correct for the task
  6. Plot gradient norms over time (should be stable, not exploding/vanishing)

Final Project

Build a Neural Network from Scratch