Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Optimization Techniques

Optimization Techniques

Your Challenge: The Valley of Deceit

Standard Gradient Descent is like walking downhill blindfolded. It works great on a smooth, simple hill. But real-world loss landscapes are treacherous.
  • Local Minima: Small dips that look like the bottom but aren’t.
  • Saddle Points: Flat areas where you get stuck.
  • Ravines: Steep walls where you bounce back and forth.
Your Goal: Navigate this treacherous terrain to find the true global minimum, fast. You need better equipment than just “walking downhill”. You need Momentum and Adaptive Steps.

Momentum: The Heavy Ball

The Intuition

Imagine rolling a ping-pong ball down a bumpy hill. It gets stuck in every little pothole (Local Minimum). Now imagine rolling a heavy bowling ball.
  • It gains speed.
  • When it hits a small pothole, its momentum carries it right through.
  • It eventually settles in the deepest valley.
Momentum Ball Analogy

The Math

Instead of just following the gradient, we keep a “velocity” (vv) that accumulates speed. vnew=βvold+(1β)f(x)xnew=xoldαvnew\begin{align} v_{new} &= \beta \cdot v_{old} + (1 - \beta) \cdot \nabla f(x) \\ x_{new} &= x_{old} - \alpha \cdot v_{new} \end{align}
  • β\beta: Friction (usually 0.9). Retains 90% of speed.
  • vv: Velocity.

The Code

import numpy as np

# A function with a local minimum at x=-2 and global at x=2
def f(x): return 0.1*x**4 - 3*x**2 + x
def grad(x): return 0.4*x**3 - 6*x + 1

# 1. Standard SGD (Gets stuck)
x = -3.0
lr = 0.1
for _ in range(20):
    x = x - lr * grad(x)
print(f"SGD stuck at x={x:.2f}")  # ~ -2.0 (Local Min)

# 2. Momentum (Escapes!)
x = -3.0
v = 0.0
beta = 0.9
for _ in range(20):
    v = beta * v + (1 - beta) * grad(x)
    x = x - lr * v
print(f"Momentum reached x={x:.2f}") # ~ 2.0 (Global Min)
Key Insight: Momentum helps you blast through small traps and speed up on flat surfaces!

RMSprop & Adam: Adaptive Shoes

The Problem with Ravines

Imagine a narrow ravine.
  • Steep walls (High gradient in yy direction).
  • Gentle slope towards the sea (Low gradient in xx direction).
If you take big steps, you bounce off the walls (yy) and never move forward (xx). If you take small steps, you move forward (xx) but it takes forever. Solution: Wear Adaptive Shoes.
  • If the ground is steep (yy), take tiny steps.
  • If the ground is flat (xx), take huge steps.

Adam (Adaptive Moment Estimation)

Adam combines both ideas:
  1. Momentum (first moment): Keep moving forward by tracking the exponential moving average of the gradient.
  2. RMSprop (second moment): Adapt step size by tracking the exponential moving average of the squared gradient.
It is the default optimizer in deep learning today, and for good reason: it works well out-of-the-box with minimal hyperparameter tuning. The default settings (lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8) work for the vast majority of problems.

The Full Adam Update (Step by Step)

Here is what Adam does on every training step, in plain language:
  1. Compute the gradient gg for the current batch
  2. Update the momentum estimate: m=0.9×mold+0.1×gm = 0.9 \times m_{old} + 0.1 \times g (moving average of gradients)
  3. Update the variance estimate: v=0.999×vold+0.001×g2v = 0.999 \times v_{old} + 0.001 \times g^2 (moving average of squared gradients)
  4. Correct for initialization bias: m^=m/(10.9t)\hat{m} = m / (1 - 0.9^t), v^=v/(10.999t)\hat{v} = v / (1 - 0.999^t) (without this, early estimates are biased toward zero)
  5. Update the weight: w=wlr×m^/(v^+ϵ)w = w - lr \times \hat{m} / (\sqrt{\hat{v}} + \epsilon)
The division by v^\sqrt{\hat{v}} is the key adaptive part: parameters with large gradients get smaller effective learning rates, and parameters with small gradients get larger effective learning rates. This is like giving each parameter its own personal learning rate, automatically tuned based on its recent gradient history.
Numerical Stability: The epsilon parameterThe eps=1e-8 in Adam is not just a theoretical safeguard — it prevents division by zero when a parameter has received very small gradients (so v^0\hat{v} \approx 0). In practice, some practitioners increase epsilon to 1e-7 or even 1e-4 for certain models (especially transformers with mixed precision training, where float16 has lower precision). The original BERT training used eps=1e-6. If you see training instability with Adam, increasing epsilon is a quick and often effective fix.

The Code (Using PyTorch)

You rarely implement Adam from scratch. You use a library.
import torch
import torch.nn as nn
import torch.optim as optim

# 1. Define your model
model = nn.Linear(10, 1)

# 2. Choose your optimizer
# SGD
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01)

# Momentum
optimizer_mom = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam (The Best)
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# 3. Training Loop
# optimizer.zero_grad()
# loss.backward()
# optimizer.step()  # This applies the math!

Comparison: Who Wins?

OptimizerAnalogyBest ForWhen to Avoid
SGDA cautious hiker with a compassConvex problems, fine-tuning (with momentum)Complex loss landscapes, initial training
MomentumA bowling ball rolling downhillNoisy gradients, escaping local minimaWhen you need per-parameter adaptation
AdamAn experienced guide with adaptive gearAlmost everything (default choice)Sometimes worse for very well-tuned vision tasks
When Adam Is Not the Best ChoiceDespite being the default, Adam does not always win. For image classification tasks (ResNets on ImageNet), well-tuned SGD with momentum often produces slightly better final accuracy. The hypothesis is that Adam’s adaptive learning rates can find sharp minima that generalize slightly worse. In practice, most teams start with Adam for fast iteration, then consider switching to SGD with momentum for the final “squeeze the last 0.5% accuracy” phase.For language models and transformers, Adam (or its variant AdamW, which decouples weight decay) is almost universally preferred. AdamW is the optimizer behind GPT, BERT, and most modern large language models.A senior engineer’s mental model: “Adam for exploration, SGD for exploitation.”

Visual Comparison

Optimizer Comparison If we race them on a complex terrain:
  1. SGD: Stumbles, gets stuck.
  2. Momentum: Overshoots but eventually settles.
  3. Adam: Beelines straight for the goal.

Practice Exercise: Escape the Trap

The Scenario

You are training a model that keeps getting stuck at 80% accuracy.
  • Loss isn’t going down.
  • Gradients are small but not zero.
Diagnosis: You are likely in a Saddle Point or Local Minimum. Your Task: Switch from SGD to Adam and observe the difference.
# Pseudo-code for your experiment
model = MyNeuralNet()
criterion = nn.MSELoss()

# Experiment A: SGD
opt_a = optim.SGD(model.parameters(), lr=0.01)
train(model, opt_a) # Result: 80% acc

# Experiment B: Adam
opt_b = optim.Adam(model.parameters(), lr=0.001)
train(model, opt_b) # Result: 95% acc!
Takeaway: Changing the optimizer is often the easiest way to improve your model!

🎯 Practice Exercises & Real-World Applications

Challenge yourself! These exercises let you experience why modern optimizers matter.

Exercise 1: Optimizer Shootout 🏁

Race different optimizers on the same problem:
import numpy as np

# Rosenbrock function - a classic optimization challenge
# f(x, y) = (1-x)² + 100(y-x²)²
# Minimum at (1, 1)

# Starting point: (-2, 2)
# This is a "banana-shaped" valley - hard for basic GD!

# TODO:
# 1. Implement SGD, Momentum, and Adam
# 2. Run each for 5000 steps
# 3. Compare: distance to optimum, path length, convergence speed
import numpy as np

def rosenbrock(x, y):
    return (1 - x)**2 + 100 * (y - x**2)**2

def rosenbrock_grad(x, y):
    dx = -2*(1-x) - 400*x*(y - x**2)
    dy = 200*(y - x**2)
    return np.array([dx, dy])

def run_sgd(start, lr=0.001, steps=5000):
    pos = np.array(start, dtype=float)
    history = [pos.copy()]
    for _ in range(steps):
        grad = rosenbrock_grad(*pos)
        pos -= lr * grad
        history.append(pos.copy())
    return history

def run_momentum(start, lr=0.001, beta=0.9, steps=5000):
    pos = np.array(start, dtype=float)
    velocity = np.zeros(2)
    history = [pos.copy()]
    for _ in range(steps):
        grad = rosenbrock_grad(*pos)
        velocity = beta * velocity + grad
        pos -= lr * velocity
        history.append(pos.copy())
    return history

def run_adam(start, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=5000):
    pos = np.array(start, dtype=float)
    m = np.zeros(2)
    v = np.zeros(2)
    history = [pos.copy()]
    for t in range(1, steps + 1):
        grad = rosenbrock_grad(*pos)
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        pos -= lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(pos.copy())
    return history

print("🏁 Optimizer Shootout: Rosenbrock Function")
print("=" * 55)
print("Target: (1, 1) | Start: (-2, 2)")

start = [-2.0, 2.0]
optimal = np.array([1.0, 1.0])

histories = {
    "SGD": run_sgd(start),
    "Momentum": run_momentum(start),
    "Adam": run_adam(start)
}

print("\n📊 Results after 5000 steps:")
print("-" * 55)
print(f"{'Optimizer':<12} {'Final Position':<20} {'Distance':<10} {'Final Loss'}")
print("-" * 55)

for name, hist in histories.items():
    final = hist[-1]
    dist = np.linalg.norm(final - optimal)
    loss = rosenbrock(*final)
    print(f"{name:<12} ({final[0]:>7.4f}, {final[1]:>7.4f})   {dist:<10.6f} {loss:.6f}")

# Convergence speed analysis
print("\n⏱️ Steps to reach distance < 0.1 from optimum:")
for name, hist in histories.items():
    for i, pos in enumerate(hist):
        if np.linalg.norm(pos - optimal) < 0.1:
            print(f"   {name}: {i} steps")
            break
    else:
        print(f"   {name}: Never reached (final dist: {np.linalg.norm(hist[-1] - optimal):.4f})")

# Path analysis
print("\n📈 Path Length (total distance traveled):")
for name, hist in histories.items():
    path_length = sum(np.linalg.norm(np.array(hist[i+1]) - np.array(hist[i])) 
                     for i in range(min(1000, len(hist)-1)))
    print(f"   {name}: {path_length:.2f}")

print("\n💡 Key Insights:")
print("   - SGD gets stuck in the curved valley")
print("   - Momentum helps but can overshoot")
print("   - Adam adapts step sizes and finds the optimum!")
Real-World Insight: The Rosenbrock function is a classic benchmark. Real neural network loss landscapes are even more complex - that’s why Adam is the default optimizer in most frameworks!

Exercise 2: Learning Rate Scheduling 📅

Implement and compare learning rate schedules:
import numpy as np

def quadratic(x):
    return (x - 5)**2

def grad(x):
    return 2*(x - 5)

# TODO:
# 1. Implement constant LR (lr = 0.1)
# 2. Implement step decay (halve every 100 steps)
# 3. Implement exponential decay (lr = lr0 * 0.99^step)
# 4. Implement cosine annealing
# 5. Compare convergence and stability
import numpy as np

def f(x):
    return (x - 5)**2

def grad(x):
    return 2*(x - 5)

def train_with_schedule(schedule_fn, steps=500, x0=0):
    x = x0
    history = [(x, f(x))]
    for step in range(steps):
        lr = schedule_fn(step)
        x = x - lr * grad(x)
        history.append((x, f(x)))
    return history

# Learning rate schedules
def constant(step, lr0=0.1):
    return lr0

def step_decay(step, lr0=0.5, decay_rate=0.5, decay_steps=100):
    return lr0 * (decay_rate ** (step // decay_steps))

def exponential(step, lr0=0.5, decay=0.99):
    return lr0 * (decay ** step)

def cosine_annealing(step, lr0=0.5, T=500):
    return lr0 * 0.5 * (1 + np.cos(np.pi * step / T))

def warmup_cosine(step, lr0=0.5, warmup_steps=50, T=500):
    if step < warmup_steps:
        return lr0 * step / warmup_steps
    return lr0 * 0.5 * (1 + np.cos(np.pi * (step - warmup_steps) / (T - warmup_steps)))

print("📅 Learning Rate Schedule Comparison")
print("=" * 55)

schedules = {
    "Constant (0.1)": lambda s: constant(s),
    "Step Decay": lambda s: step_decay(s),
    "Exponential": lambda s: exponential(s),
    "Cosine Annealing": lambda s: cosine_annealing(s),
    "Warmup + Cosine": lambda s: warmup_cosine(s),
}

print("\n📊 Training Results (500 steps, target x=5):")
print("-" * 55)
print(f"{'Schedule':<20} {'Final x':<12} {'Final Loss':<12} {'Converged at'}")
print("-" * 55)

for name, sched in schedules.items():
    hist = train_with_schedule(sched)
    final_x, final_loss = hist[-1]
    
    # Find when it first got close
    converge_step = None
    for i, (x, loss) in enumerate(hist):
        if abs(x - 5) < 0.01:
            converge_step = i
            break
    
    conv_str = f"step {converge_step}" if converge_step else "Never"
    print(f"{name:<20} {final_x:<12.6f} {final_loss:<12.8f} {conv_str}")

# Learning rate visualization
print("\n📈 Learning Rate Over Time:")
print("   Step | Constant | StepDecay | Exponent | Cosine  ")
print("   -----|----------|-----------|----------|--------")
for step in [0, 50, 100, 200, 300, 400, 499]:
    c = constant(step)
    s = step_decay(step)
    e = exponential(step)
    cos = cosine_annealing(step)
    print(f"   {step:4} | {c:8.4f} | {s:9.4f} | {e:8.4f} | {cos:8.4f}")

print("\n💡 Key Insights:")
print("   - Constant LR: Simple but may oscillate near optimum")
print("   - Step decay: Sudden drops can cause instability")
print("   - Exponential: Smooth decay, widely used")
print("   - Cosine: Modern favorite, smooth and goes to zero")
print("   - Warmup: Helps with unstable early gradients")
Real-World Insight: BERT, GPT, and most modern transformers use warmup + cosine (or linear) decay. The warmup phase is crucial for training stability with adaptive optimizers!

Exercise 3: Batch Size Trade-offs ⚖️

Explore the relationship between batch size and training:
import numpy as np

# Generate regression data
np.random.seed(42)
n = 10000
X = np.random.randn(n, 5)
true_w = np.array([1, -2, 0.5, 3, -1])
y = X @ true_w + np.random.randn(n) * 0.5

# TODO:
# 1. Train with batch sizes: 1, 16, 64, 256, full
# 2. For each: measure steps to convergence, final accuracy, variance in updates
# 3. Plot the gradient variance for different batch sizes
# 4. Find the "sweet spot" batch size for this problem
import numpy as np

np.random.seed(42)
n = 10000
X = np.random.randn(n, 5)
true_w = np.array([1.0, -2.0, 0.5, 3.0, -1.0])
y = X @ true_w + np.random.randn(n) * 0.5

def loss(w):
    return np.mean((X @ w - y) ** 2)

def gradient_full(w):
    return 2 * X.T @ (X @ w - y) / n

def gradient_batch(w, batch_idx):
    X_b = X[batch_idx]
    y_b = y[batch_idx]
    return 2 * X_b.T @ (X_b @ w - y_b) / len(batch_idx)

def train(batch_size, lr=0.01, max_epochs=50):
    w = np.zeros(5)
    losses = []
    grad_variances = []
    
    for epoch in range(max_epochs):
        if batch_size >= n:
            grad = gradient_full(w)
            w = w - lr * grad
            grad_variances.append(0)  # No variance in full batch
        else:
            indices = np.random.permutation(n)
            epoch_grads = []
            for i in range(0, n, batch_size):
                batch_idx = indices[i:i+batch_size]
                grad = gradient_batch(w, batch_idx)
                epoch_grads.append(grad)
                w = w - lr * grad
            grad_variances.append(np.var(epoch_grads))
        
        losses.append(loss(w))
        if losses[-1] < 0.3:  # Converged
            break
    
    return w, losses, grad_variances

print("⚖️ Batch Size Trade-offs")
print("=" * 55)

batch_sizes = [1, 16, 64, 256, 1024, n]
batch_names = ["1 (SGD)", "16", "64", "256", "1024", "Full"]

print("\n📊 Training Results (max 50 epochs):")
print("-" * 65)
print(f"{'Batch Size':<12} {'Epochs':<8} {'Final Loss':<12} {'Weight Error':<15} {'Avg Grad Var'}")
print("-" * 65)

for bs, name in zip(batch_sizes, batch_names):
    lr = 0.01 if bs < 256 else 0.1  # Larger LR for larger batches
    w, losses, gv = train(bs, lr=lr)
    epochs = len(losses)
    final_loss = losses[-1]
    w_error = np.linalg.norm(w - true_w)
    avg_var = np.mean(gv) if gv and gv[0] > 0 else 0
    
    print(f"{name:<12} {epochs:<8} {final_loss:<12.6f} {w_error:<15.6f} {avg_var:.6f}")

# Gradient variance analysis
print("\n📈 Gradient Variance vs Batch Size:")
print("   (Lower variance = more stable updates)")
for bs, name in zip([1, 16, 64, 256], ["1", "16", "64", "256"]):
    w = np.zeros(5)
    variances = []
    for _ in range(100):
        batch_idx = np.random.choice(n, bs, replace=False)
        grad = gradient_batch(w, batch_idx)
        variances.append(np.linalg.norm(grad))
    
    variance = np.var(variances)
    bar = "█" * int(variance * 5)
    print(f"   BS={name:>4}: {bar} {variance:.4f}")

print("\n💡 Key Insights:")
print("   - Small batches: High variance but can escape local minima")
print("   - Large batches: Low variance but may need higher LR")
print("   - Sweet spot (16-64): Balance of speed and stability")
print("   - GPUs favor powers of 2 (32, 64, 128, 256)")
Real-World Insight: Google’s “Large Batch Training” research showed you can use batch sizes of 32K+ with proper learning rate scaling. This enables training GPT-4 class models in days instead of months!

Exercise 4: Adam from Scratch 🔧

Implement the Adam optimizer and understand each component:
import numpy as np

# Implement Adam optimizer for a 2D problem
def f(x, y):
    return 0.5*x**2 + 5*y**2  # Elliptical bowl

def grad(x, y):
    return np.array([x, 10*y])

# TODO:
# 1. Implement Adam with m (momentum), v (RMSprop), and bias correction
# 2. Compare Adam to vanilla SGD on this problem
# 3. Visualize how m and v evolve during training
# 4. Show why Adam handles the different curvatures in x and y
import numpy as np

def f(x, y):
    return 0.5*x**2 + 5*y**2

def grad(x, y):
    return np.array([x, 10*y])

def sgd(start, lr=0.1, steps=100):
    pos = np.array(start, dtype=float)
    history = [pos.copy()]
    for _ in range(steps):
        g = grad(*pos)
        pos -= lr * g
        history.append(pos.copy())
    return history

def adam(start, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100):
    pos = np.array(start, dtype=float)
    m = np.zeros(2)  # First moment (momentum)
    v = np.zeros(2)  # Second moment (RMSprop)
    
    history = [pos.copy()]
    m_history = [m.copy()]
    v_history = [v.copy()]
    
    for t in range(1, steps + 1):
        g = grad(*pos)
        
        # Update moments
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        
        # Bias correction
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        
        # Update position
        pos -= lr * m_hat / (np.sqrt(v_hat) + eps)
        
        history.append(pos.copy())
        m_history.append(m.copy())
        v_history.append(v.copy())
    
    return history, m_history, v_history

print("🔧 Adam from Scratch")
print("=" * 55)
print("Function: f(x,y) = 0.5x² + 5y² (ellipse, steep in y)")
print("Start: (10, 1)")

start = [10.0, 1.0]

sgd_hist = sgd(start)
adam_hist, m_hist, v_hist = adam(start)

print("\n📊 Comparison (100 steps):")
print("-" * 55)
print(f"{'Step':<6} {'SGD Position':<25} {'Adam Position'}")
print("-" * 55)

for step in [0, 10, 25, 50, 100]:
    sgd_pos = sgd_hist[step]
    adam_pos = adam_hist[step]
    print(f"{step:<6} ({sgd_pos[0]:>8.4f}, {sgd_pos[1]:>8.4f})      ({adam_pos[0]:>8.4f}, {adam_pos[1]:>8.4f})")

print("\n🔬 Adam's Internal State Evolution:")
print(f"{'Step':<6} {'Gradient':<20} {'m (momentum)':<20} {'v (variance)'}")
print("-" * 70)
for step in [1, 5, 10, 25]:
    g = grad(*adam_hist[step-1])
    m = m_hist[step]
    v = v_hist[step]
    print(f"{step:<6} ({g[0]:>6.2f}, {g[1]:>6.2f})     ({m[0]:>6.4f}, {m[1]:>6.4f})     ({v[0]:>6.4f}, {v[1]:>6.4f})")

print("\n📈 Effective Learning Rates (Adam adapts per-dimension!):")
print(f"{'Step':<6} {'LR for x':<15} {'LR for y':<15}")
print("-" * 40)
for step in [1, 10, 50]:
    m = m_hist[step]
    v = v_hist[step]
    m_hat = m / (1 - 0.9**step)
    v_hat = v / (1 - 0.999**step)
    
    lr_x = 0.1 / (np.sqrt(v_hat[0]) + 1e-8)
    lr_y = 0.1 / (np.sqrt(v_hat[1]) + 1e-8)
    print(f"{step:<6} {lr_x:<15.6f} {lr_y:<15.6f}")

print("\n💡 Key Insights:")
print("   - Adam takes larger steps in x (flat direction)")
print("   - Adam takes smaller steps in y (steep direction)")
print("   - m provides momentum for consistent direction")
print("   - v provides per-dimension scaling")
print("   - Bias correction is crucial in early steps!")
Real-World Insight: Adam is the default optimizer in PyTorch and TensorFlow. Understanding its components helps you debug training issues - if loss is oscillating, you might need lower β1; if it’s not decreasing, you might need higher learning rate!

What’s Next?

You have mastered the core math of learning!
  1. Derivatives: How things change.
  2. Gradients: The direction of change.
  3. Chain Rule: How changes propagate.
  4. Gradient Descent: How to learn.
  5. Optimization: How to learn fast.
Now, you are ready to build something real.

Quick Reference: Optimizer Selection Guide

Bookmark this! Use it when starting a new project.
ScenarioRecommended OptimizerSettings
Default starting pointAdamlr=0.001
Computer Vision (CNN)SGD + Momentumlr=0.1, momentum=0.9 + scheduler
Transformers/LLMsAdamWlr=1e-4 to 3e-4, weight_decay=0.01
Fine-tuning pre-trainedAdamWlr=2e-5 to 5e-5
Small datasetAdamlr=0.001
Reinforcement LearningAdamlr=3e-4
GANsAdamlr=0.0002, betas=(0.5, 0.999)
Sparse gradients (embeddings)SparseAdamlr=0.001

Hyperparameter Tuning Priority

1. Learning Rate    ← Most important! Try [1e-4, 1e-3, 1e-2, 0.1]
2. Batch Size       ← 16-64 for small data, 256-1024 for large
3. Momentum/Beta1   ← Usually 0.9 is fine
4. Beta2            ← Usually 0.999 is fine
5. Weight Decay     ← Try [0, 1e-4, 1e-2] if overfitting

Interview Questions: Optimizers

Answer: Adam combines momentum (averages gradients) with adaptive learning rates (scales by gradient variance). This helps it:
  • Handle sparse gradients well
  • Converge faster with less tuning
  • Adapt to each parameter’s needs
However, SGD + momentum often generalizes better for computer vision - it’s just harder to tune!
Answer: AdamW decouples weight decay from the gradient update. In vanilla Adam, weight decay is applied to the gradient, which interacts poorly with the adaptive learning rate. AdamW applies weight decay directly to the weights:
# Adam (problematic)
w -= lr * (gradient + weight_decay * w) / sqrt(v)

# AdamW (correct)  
w -= lr * gradient / sqrt(v)
w -= lr * weight_decay * w
Answer:
  1. First, verify the model can overfit a tiny subset (10 examples). If not, there’s a bug.
  2. Check for NaN/Inf in gradients and activations
  3. Try different learning rates (10x higher and 10x lower)
  4. Check data preprocessing (normalization, labels)
  5. Verify loss function is correct for the task
  6. Plot gradient norms over time (should be stable, not exploding/vanishing)

Final Project

Build a Neural Network from Scratch

Interview Deep-Dive

Strong Answer:
  • Adam adapts the learning rate per parameter using running averages of first and second gradient moments. This makes it excellent at handling sparse gradients, noisy objectives, and landscapes with wildly different curvatures per parameter. It converges quickly and requires minimal tuning — which is why it is the default.
  • However, there is strong empirical evidence (Wilson et al., 2017) that SGD with momentum generalizes better than Adam for computer vision tasks. The leading theory is that Adam’s adaptive rates effectively give each parameter its own loss landscape, and these individual trajectories can converge to sharper minima compared to SGD’s “one learning rate for all” approach. SGD’s uniform step size forces all parameters through the same optimization dynamics, which acts as an implicit regularizer.
  • My decision framework for a new project: Start with Adam (lr=1e-3) to get a quick baseline — it will converge fast and tell you if the architecture and data pipeline are working. Then, if the task is well-established (like ImageNet classification), switch to SGD with momentum for the final model because the generalization benefit is worth the extra tuning effort. For transformers and NLP, use AdamW (Adam with decoupled weight decay) because the adaptive rates are essential for the highly heterogeneous gradient landscape of attention mechanisms. For fine-tuning pretrained models, always Adam or AdamW at a low learning rate (1e-5 to 5e-5).
  • A nuance most people miss: Adam and AdamW are different in a meaningful way. Vanilla Adam applies weight decay to the gradient before the adaptive scaling, which means the actual regularization strength depends on the gradient magnitude — parameters with large gradients get less effective regularization. AdamW applies weight decay directly to the weights, decoupled from the gradient computation. Loshchilov and Hutter (2019) showed this decoupling is critical for proper regularization in transformers.
Follow-up: You mentioned that Adam’s adaptive rates can lead to sharper minima. Is there a way to get the convergence speed of Adam with the generalization of SGD?Several recent optimizers attempt exactly this. SAM (Sharpness-Aware Minimization) takes a different approach: at each step, it first perturbs the weights in the direction that maximizes loss (finds the sharpest nearby point), then computes the gradient at that perturbed point. This explicitly optimizes for flat minima regardless of the base optimizer. In practice, SAM with SGD or Adam gives 0.5-1.5% accuracy improvement on ImageNet and CIFAR but doubles the compute cost (two forward-backward passes per step). Another approach is AdaBound and RAdam (Rectified Adam), which start with Adam’s adaptive behavior and gradually transition to SGD-like behavior as training progresses. The field is still actively evolving, but the pattern is clear: the best results come from combining adaptive early-training dynamics with SGD-like late-training dynamics.
Strong Answer:
  • Adam maintains two running averages per parameter. The first moment estimate m tracks the exponential moving average of gradients: m_t = beta1 * m_(t-1) + (1 - beta1) * g_t, where beta1 is typically 0.9. This is momentum — it smooths out gradient noise and builds up velocity in consistent directions.
  • The second moment estimate v tracks the exponential moving average of squared gradients: v_t = beta2 * v_(t-1) + (1 - beta2) * g_t^2, where beta2 is typically 0.999. This captures per-parameter gradient variance — how much the gradient fluctuates for each parameter.
  • Bias correction is critical in early steps. Since m and v are initialized to zero, they are biased toward zero for the first several iterations. The correction divides by (1 - beta^t): m_hat = m_t / (1 - beta1^t), v_hat = v_t / (1 - beta2^t). Without this, the first few updates would be artificially small, causing a slow start.
  • The parameter update is: theta_t = theta_(t-1) - lr * m_hat / (sqrt(v_hat) + epsilon). The division by sqrt(v_hat) is the adaptive learning rate — parameters with large historical gradients get smaller steps, and vice versa. Epsilon (typically 1e-8) prevents division by zero.
  • If you remove momentum (m): you get RMSprop. It still adapts per-parameter but lacks the smoothing effect. This makes it more sensitive to noisy gradients and more likely to oscillate.
  • If you remove the adaptive rate (v): you get SGD with momentum. You lose per-parameter adaptation, so you need a single global learning rate that works for all parameters — harder to tune.
  • If you remove bias correction: early training steps use the heavily biased m and v estimates, leading to tiny initial updates. For short training runs or when warm-starting from a checkpoint, this can significantly slow convergence.
  • If you change epsilon from 1e-8 to something larger (like 1e-3): you reduce the adaptive effect. Parameters with small gradients no longer get dramatically larger effective learning rates. This can be useful when the adaptive behavior is too aggressive and causes training instability.
Follow-up: The beta2 parameter is set to 0.999, meaning the second moment has a very long memory. Why is this different from beta1=0.9, and what goes wrong if you set beta2=0.9?The second moment v needs a longer memory because it estimates gradient variance, which changes slowly relative to the gradient direction. The gradient direction (captured by m) can flip quickly — you might be going left one step and right the next. So beta1=0.9 gives a rolling average over roughly the last 10 steps, which is responsive enough to track direction changes. But the gradient magnitude (captured by v) represents the underlying curvature of the loss surface, which changes much more slowly. beta2=0.999 averages over roughly the last 1000 steps, giving a stable estimate of curvature. If you set beta2=0.9, the v estimate becomes noisy and reactive — a single large gradient spike would dramatically shrink the effective learning rate for many subsequent steps. This makes training erratic and can cause the optimizer to “freeze” certain parameters after encountering outlier gradients. I have seen beta2=0.9 cause training instability in transformer models where gradient magnitudes can vary by orders of magnitude across steps due to attention pattern changes.
Strong Answer:
  • A saddle point is a critical point (gradient = 0) where the surface curves upward in some directions and downward in others. Think of a mountain pass: if you stand at the top of the pass, the terrain goes up toward the peaks on either side but down into the valleys on either end.
  • In high dimensions, saddle points vastly outnumber local minima. The intuition: at a critical point, each eigenvalue of the Hessian is independently likely to be positive or negative. For a random critical point in N dimensions, the probability that ALL N eigenvalues are positive (true minimum) is about 2^(-N). For N = 1000, that is astronomically unlikely. Almost all critical points are saddle points.
  • Why they are problematic: the gradient is zero (or very small near the saddle), so gradient-based optimizers stall. The optimizer does not know which direction to move because the landscape looks flat locally. Unlike a local minimum where you are at least at a low point, a saddle point might have you at a high-loss region with a clear escape route — but the gradient cannot “see” it because the first-order information is zero.
  • How modern optimizers handle them: SGD with mini-batch noise naturally perturbs the optimizer away from exact saddle points. The stochastic gradient is almost never exactly zero even at a saddle. Momentum accumulates velocity from past gradients, so even if the current gradient is small, the optimizer keeps moving based on its history. Adam further helps because its adaptive rates can amplify steps in flat directions (where v is small, the effective learning rate is large), which is often exactly the escape direction from a saddle.
  • More sophisticated approaches include negative curvature exploitation: if you can identify directions where the Hessian has negative eigenvalues (the “downhill” directions of the saddle), you move along those directions. Cubic regularization methods and trust-region methods do this systematically. In practice, the combination of SGD noise + momentum + adaptive rates handles saddle points well enough for most deep learning tasks.
Follow-up: If gradient descent can get stuck at saddle points, can it also get stuck at local maxima? Why or why not?In theory, gradient descent could converge to a local maximum (all Hessian eigenvalues negative), but this is exponentially unlikely in practice. A local maximum is an unstable equilibrium — any tiny perturbation pushes you away from it, like balancing a ball on top of a hill. Even infinitesimal floating-point noise is enough to perturb the optimizer away. Saddle points are different because they have some attracting directions (the positive-curvature directions pull you toward the saddle) and some repelling directions. If you approach a saddle along an attracting direction, you can get “sucked in” and stall. The repelling directions eventually take over (especially with noise), but the stalling can be long. Local maxima have no attracting directions, so you never converge to them in practice. This is why the optimization discussion in deep learning focuses on saddle points and local minima, never local maxima.
Strong Answer:
  • This advice works surprisingly often for getting something to train, which is why it persists. But “something trains” and “trains optimally” are very different. There are several well-documented failure modes.
  • First, Adam can fail to generalize as well as SGD for certain architectures. On ImageNet with ResNets, SGD with momentum and a carefully tuned schedule consistently achieves 0.5-1% better top-1 accuracy than Adam. At production scale, that difference matters.
  • Second, lr=0.001 is wrong for fine-tuning. If you fine-tune a BERT or GPT model with lr=0.001, you will catastrophically overwrite the pretrained weights in the first few steps. Fine-tuning requires lr in the range 1e-5 to 5e-5, typically 20-100x smaller than training from scratch.
  • Third, Adam’s adaptive rates can cause unstable training with certain loss landscapes. In GANs, the adversarial dynamics create a non-stationary objective, and Adam’s second moment can lag behind rapid loss surface changes. The recommended Adam settings for GANs use beta1=0.5 instead of 0.9 and beta2=0.999, which is quite different from the default.
  • Fourth, for reinforcement learning, Adam with lr=0.001 often destabilizes training. The standard RL learning rate is 3e-4 (PPO default) or lower, and the nonstationarity of RL objectives means Adam’s moment estimates are frequently stale.
  • Fifth, vanilla Adam (not AdamW) applies weight decay incorrectly, as discussed earlier. Using Adam when you should use AdamW results in weaker regularization for parameters with large gradients, which hurts generalization.
  • My recommendation: Adam with lr=0.001 is a great starting point for prototyping. But for any model going to production, treat the optimizer and its hyperparameters as tunable. At minimum, tune the learning rate on a log scale (1e-5, 1e-4, 1e-3, 1e-2) and compare Adam versus AdamW versus SGD+momentum for your specific task.
Follow-up: You mentioned that Adam’s moment estimates can become stale in non-stationary settings. How does this manifest in practice, and what is the fix?Staleness manifests as the optimizer “remembering” gradient statistics from a landscape that no longer exists. In GANs, the loss surface shifts every time the discriminator updates. Adam’s v estimate (beta2=0.999) averages over roughly 1000 past steps, but the loss surface from 1000 steps ago is completely different. This means the per-parameter learning rate scaling is based on outdated curvature information. You see symptoms like sudden loss spikes, mode collapse, or oscillating training dynamics. The fix is reducing beta2 — setting it to 0.99 or even 0.9 shortens the memory window so the optimizer adapts faster to the changing landscape. For RL, a similar fix works. RAdam (Rectified Adam) provides a more principled fix by automatically reducing the influence of the adaptive term when its estimate is unreliable (high variance), effectively falling back to SGD-like behavior when the moment estimates are stale.