Standard Gradient Descent is like walking downhill blindfolded. It works great on a smooth, simple hill.But real-world loss landscapes are treacherous.
Local Minima: Small dips that look like the bottom but aren’t.
Saddle Points: Flat areas where you get stuck.
Ravines: Steep walls where you bounce back and forth.
Your Goal: Navigate this treacherous terrain to find the true global minimum, fast.You need better equipment than just “walking downhill”. You need Momentum and Adaptive Steps.
import numpy as np# A function with a local minimum at x=-2 and global at x=2def f(x): return 0.1*x**4 - 3*x**2 + xdef grad(x): return 0.4*x**3 - 6*x + 1# 1. Standard SGD (Gets stuck)x = -3.0lr = 0.1for _ in range(20): x = x - lr * grad(x)print(f"SGD stuck at x={x:.2f}") # ~ -2.0 (Local Min)# 2. Momentum (Escapes!)x = -3.0v = 0.0beta = 0.9for _ in range(20): v = beta * v + (1 - beta) * grad(x) x = x - lr * vprint(f"Momentum reached x={x:.2f}") # ~ 2.0 (Global Min)
Key Insight: Momentum helps you blast through small traps and speed up on flat surfaces!
Gentle slope towards the sea (Low gradient in x direction).
If you take big steps, you bounce off the walls (y) and never move forward (x).
If you take small steps, you move forward (x) but it takes forever.Solution: Wear Adaptive Shoes.
You rarely implement Adam from scratch. You use a library.
Copy
import torchimport torch.nn as nnimport torch.optim as optim# 1. Define your modelmodel = nn.Linear(10, 1)# 2. Choose your optimizer# SGDoptimizer_sgd = optim.SGD(model.parameters(), lr=0.01)# Momentumoptimizer_mom = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)# Adam (The Best)optimizer_adam = optim.Adam(model.parameters(), lr=0.001)# 3. Training Loop# optimizer.zero_grad()# loss.backward()# optimizer.step() # This applies the math!
import numpy as np# Rosenbrock function - a classic optimization challenge# f(x, y) = (1-x)² + 100(y-x²)²# Minimum at (1, 1)# Starting point: (-2, 2)# This is a "banana-shaped" valley - hard for basic GD!# TODO:# 1. Implement SGD, Momentum, and Adam# 2. Run each for 5000 steps# 3. Compare: distance to optimum, path length, convergence speed
💡 Solution
Copy
import numpy as npdef rosenbrock(x, y): return (1 - x)**2 + 100 * (y - x**2)**2def rosenbrock_grad(x, y): dx = -2*(1-x) - 400*x*(y - x**2) dy = 200*(y - x**2) return np.array([dx, dy])def run_sgd(start, lr=0.001, steps=5000): pos = np.array(start, dtype=float) history = [pos.copy()] for _ in range(steps): grad = rosenbrock_grad(*pos) pos -= lr * grad history.append(pos.copy()) return historydef run_momentum(start, lr=0.001, beta=0.9, steps=5000): pos = np.array(start, dtype=float) velocity = np.zeros(2) history = [pos.copy()] for _ in range(steps): grad = rosenbrock_grad(*pos) velocity = beta * velocity + grad pos -= lr * velocity history.append(pos.copy()) return historydef run_adam(start, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, steps=5000): pos = np.array(start, dtype=float) m = np.zeros(2) v = np.zeros(2) history = [pos.copy()] for t in range(1, steps + 1): grad = rosenbrock_grad(*pos) m = beta1 * m + (1 - beta1) * grad v = beta2 * v + (1 - beta2) * grad**2 m_hat = m / (1 - beta1**t) v_hat = v / (1 - beta2**t) pos -= lr * m_hat / (np.sqrt(v_hat) + eps) history.append(pos.copy()) return historyprint("🏁 Optimizer Shootout: Rosenbrock Function")print("=" * 55)print("Target: (1, 1) | Start: (-2, 2)")start = [-2.0, 2.0]optimal = np.array([1.0, 1.0])histories = { "SGD": run_sgd(start), "Momentum": run_momentum(start), "Adam": run_adam(start)}print("\n📊 Results after 5000 steps:")print("-" * 55)print(f"{'Optimizer':<12} {'Final Position':<20} {'Distance':<10} {'Final Loss'}")print("-" * 55)for name, hist in histories.items(): final = hist[-1] dist = np.linalg.norm(final - optimal) loss = rosenbrock(*final) print(f"{name:<12} ({final[0]:>7.4f}, {final[1]:>7.4f}) {dist:<10.6f} {loss:.6f}")# Convergence speed analysisprint("\n⏱️ Steps to reach distance < 0.1 from optimum:")for name, hist in histories.items(): for i, pos in enumerate(hist): if np.linalg.norm(pos - optimal) < 0.1: print(f" {name}: {i} steps") break else: print(f" {name}: Never reached (final dist: {np.linalg.norm(hist[-1] - optimal):.4f})")# Path analysisprint("\n📈 Path Length (total distance traveled):")for name, hist in histories.items(): path_length = sum(np.linalg.norm(np.array(hist[i+1]) - np.array(hist[i])) for i in range(min(1000, len(hist)-1))) print(f" {name}: {path_length:.2f}")print("\n💡 Key Insights:")print(" - SGD gets stuck in the curved valley")print(" - Momentum helps but can overshoot")print(" - Adam adapts step sizes and finds the optimum!")
Real-World Insight: The Rosenbrock function is a classic benchmark. Real neural network loss landscapes are even more complex - that’s why Adam is the default optimizer in most frameworks!
import numpy as npdef quadratic(x): return (x - 5)**2def grad(x): return 2*(x - 5)# TODO:# 1. Implement constant LR (lr = 0.1)# 2. Implement step decay (halve every 100 steps)# 3. Implement exponential decay (lr = lr0 * 0.99^step)# 4. Implement cosine annealing# 5. Compare convergence and stability
💡 Solution
Copy
import numpy as npdef f(x): return (x - 5)**2def grad(x): return 2*(x - 5)def train_with_schedule(schedule_fn, steps=500, x0=0): x = x0 history = [(x, f(x))] for step in range(steps): lr = schedule_fn(step) x = x - lr * grad(x) history.append((x, f(x))) return history# Learning rate schedulesdef constant(step, lr0=0.1): return lr0def step_decay(step, lr0=0.5, decay_rate=0.5, decay_steps=100): return lr0 * (decay_rate ** (step // decay_steps))def exponential(step, lr0=0.5, decay=0.99): return lr0 * (decay ** step)def cosine_annealing(step, lr0=0.5, T=500): return lr0 * 0.5 * (1 + np.cos(np.pi * step / T))def warmup_cosine(step, lr0=0.5, warmup_steps=50, T=500): if step < warmup_steps: return lr0 * step / warmup_steps return lr0 * 0.5 * (1 + np.cos(np.pi * (step - warmup_steps) / (T - warmup_steps)))print("📅 Learning Rate Schedule Comparison")print("=" * 55)schedules = { "Constant (0.1)": lambda s: constant(s), "Step Decay": lambda s: step_decay(s), "Exponential": lambda s: exponential(s), "Cosine Annealing": lambda s: cosine_annealing(s), "Warmup + Cosine": lambda s: warmup_cosine(s),}print("\n📊 Training Results (500 steps, target x=5):")print("-" * 55)print(f"{'Schedule':<20} {'Final x':<12} {'Final Loss':<12} {'Converged at'}")print("-" * 55)for name, sched in schedules.items(): hist = train_with_schedule(sched) final_x, final_loss = hist[-1] # Find when it first got close converge_step = None for i, (x, loss) in enumerate(hist): if abs(x - 5) < 0.01: converge_step = i break conv_str = f"step {converge_step}" if converge_step else "Never" print(f"{name:<20} {final_x:<12.6f} {final_loss:<12.8f} {conv_str}")# Learning rate visualizationprint("\n📈 Learning Rate Over Time:")print(" Step | Constant | StepDecay | Exponent | Cosine ")print(" -----|----------|-----------|----------|--------")for step in [0, 50, 100, 200, 300, 400, 499]: c = constant(step) s = step_decay(step) e = exponential(step) cos = cosine_annealing(step) print(f" {step:4} | {c:8.4f} | {s:9.4f} | {e:8.4f} | {cos:8.4f}")print("\n💡 Key Insights:")print(" - Constant LR: Simple but may oscillate near optimum")print(" - Step decay: Sudden drops can cause instability")print(" - Exponential: Smooth decay, widely used")print(" - Cosine: Modern favorite, smooth and goes to zero")print(" - Warmup: Helps with unstable early gradients")
Real-World Insight: BERT, GPT, and most modern transformers use warmup + cosine (or linear) decay. The warmup phase is crucial for training stability with adaptive optimizers!
Explore the relationship between batch size and training:
Copy
import numpy as np# Generate regression datanp.random.seed(42)n = 10000X = np.random.randn(n, 5)true_w = np.array([1, -2, 0.5, 3, -1])y = X @ true_w + np.random.randn(n) * 0.5# TODO:# 1. Train with batch sizes: 1, 16, 64, 256, full# 2. For each: measure steps to convergence, final accuracy, variance in updates# 3. Plot the gradient variance for different batch sizes# 4. Find the "sweet spot" batch size for this problem
💡 Solution
Copy
import numpy as npnp.random.seed(42)n = 10000X = np.random.randn(n, 5)true_w = np.array([1.0, -2.0, 0.5, 3.0, -1.0])y = X @ true_w + np.random.randn(n) * 0.5def loss(w): return np.mean((X @ w - y) ** 2)def gradient_full(w): return 2 * X.T @ (X @ w - y) / ndef gradient_batch(w, batch_idx): X_b = X[batch_idx] y_b = y[batch_idx] return 2 * X_b.T @ (X_b @ w - y_b) / len(batch_idx)def train(batch_size, lr=0.01, max_epochs=50): w = np.zeros(5) losses = [] grad_variances = [] for epoch in range(max_epochs): if batch_size >= n: grad = gradient_full(w) w = w - lr * grad grad_variances.append(0) # No variance in full batch else: indices = np.random.permutation(n) epoch_grads = [] for i in range(0, n, batch_size): batch_idx = indices[i:i+batch_size] grad = gradient_batch(w, batch_idx) epoch_grads.append(grad) w = w - lr * grad grad_variances.append(np.var(epoch_grads)) losses.append(loss(w)) if losses[-1] < 0.3: # Converged break return w, losses, grad_variancesprint("⚖️ Batch Size Trade-offs")print("=" * 55)batch_sizes = [1, 16, 64, 256, 1024, n]batch_names = ["1 (SGD)", "16", "64", "256", "1024", "Full"]print("\n📊 Training Results (max 50 epochs):")print("-" * 65)print(f"{'Batch Size':<12} {'Epochs':<8} {'Final Loss':<12} {'Weight Error':<15} {'Avg Grad Var'}")print("-" * 65)for bs, name in zip(batch_sizes, batch_names): lr = 0.01 if bs < 256 else 0.1 # Larger LR for larger batches w, losses, gv = train(bs, lr=lr) epochs = len(losses) final_loss = losses[-1] w_error = np.linalg.norm(w - true_w) avg_var = np.mean(gv) if gv and gv[0] > 0 else 0 print(f"{name:<12} {epochs:<8} {final_loss:<12.6f} {w_error:<15.6f} {avg_var:.6f}")# Gradient variance analysisprint("\n📈 Gradient Variance vs Batch Size:")print(" (Lower variance = more stable updates)")for bs, name in zip([1, 16, 64, 256], ["1", "16", "64", "256"]): w = np.zeros(5) variances = [] for _ in range(100): batch_idx = np.random.choice(n, bs, replace=False) grad = gradient_batch(w, batch_idx) variances.append(np.linalg.norm(grad)) variance = np.var(variances) bar = "█" * int(variance * 5) print(f" BS={name:>4}: {bar} {variance:.4f}")print("\n💡 Key Insights:")print(" - Small batches: High variance but can escape local minima")print(" - Large batches: Low variance but may need higher LR")print(" - Sweet spot (16-64): Balance of speed and stability")print(" - GPUs favor powers of 2 (32, 64, 128, 256)")
Real-World Insight: Google’s “Large Batch Training” research showed you can use batch sizes of 32K+ with proper learning rate scaling. This enables training GPT-4 class models in days instead of months!
Implement the Adam optimizer and understand each component:
Copy
import numpy as np# Implement Adam optimizer for a 2D problemdef f(x, y): return 0.5*x**2 + 5*y**2 # Elliptical bowldef grad(x, y): return np.array([x, 10*y])# TODO:# 1. Implement Adam with m (momentum), v (RMSprop), and bias correction# 2. Compare Adam to vanilla SGD on this problem# 3. Visualize how m and v evolve during training# 4. Show why Adam handles the different curvatures in x and y
💡 Solution
Copy
import numpy as npdef f(x, y): return 0.5*x**2 + 5*y**2def grad(x, y): return np.array([x, 10*y])def sgd(start, lr=0.1, steps=100): pos = np.array(start, dtype=float) history = [pos.copy()] for _ in range(steps): g = grad(*pos) pos -= lr * g history.append(pos.copy()) return historydef adam(start, lr=0.1, beta1=0.9, beta2=0.999, eps=1e-8, steps=100): pos = np.array(start, dtype=float) m = np.zeros(2) # First moment (momentum) v = np.zeros(2) # Second moment (RMSprop) history = [pos.copy()] m_history = [m.copy()] v_history = [v.copy()] for t in range(1, steps + 1): g = grad(*pos) # Update moments m = beta1 * m + (1 - beta1) * g v = beta2 * v + (1 - beta2) * g**2 # Bias correction m_hat = m / (1 - beta1**t) v_hat = v / (1 - beta2**t) # Update position pos -= lr * m_hat / (np.sqrt(v_hat) + eps) history.append(pos.copy()) m_history.append(m.copy()) v_history.append(v.copy()) return history, m_history, v_historyprint("🔧 Adam from Scratch")print("=" * 55)print("Function: f(x,y) = 0.5x² + 5y² (ellipse, steep in y)")print("Start: (10, 1)")start = [10.0, 1.0]sgd_hist = sgd(start)adam_hist, m_hist, v_hist = adam(start)print("\n📊 Comparison (100 steps):")print("-" * 55)print(f"{'Step':<6} {'SGD Position':<25} {'Adam Position'}")print("-" * 55)for step in [0, 10, 25, 50, 100]: sgd_pos = sgd_hist[step] adam_pos = adam_hist[step] print(f"{step:<6} ({sgd_pos[0]:>8.4f}, {sgd_pos[1]:>8.4f}) ({adam_pos[0]:>8.4f}, {adam_pos[1]:>8.4f})")print("\n🔬 Adam's Internal State Evolution:")print(f"{'Step':<6} {'Gradient':<20} {'m (momentum)':<20} {'v (variance)'}")print("-" * 70)for step in [1, 5, 10, 25]: g = grad(*adam_hist[step-1]) m = m_hist[step] v = v_hist[step] print(f"{step:<6} ({g[0]:>6.2f}, {g[1]:>6.2f}) ({m[0]:>6.4f}, {m[1]:>6.4f}) ({v[0]:>6.4f}, {v[1]:>6.4f})")print("\n📈 Effective Learning Rates (Adam adapts per-dimension!):")print(f"{'Step':<6} {'LR for x':<15} {'LR for y':<15}")print("-" * 40)for step in [1, 10, 50]: m = m_hist[step] v = v_hist[step] m_hat = m / (1 - 0.9**step) v_hat = v / (1 - 0.999**step) lr_x = 0.1 / (np.sqrt(v_hat[0]) + 1e-8) lr_y = 0.1 / (np.sqrt(v_hat[1]) + 1e-8) print(f"{step:<6} {lr_x:<15.6f} {lr_y:<15.6f}")print("\n💡 Key Insights:")print(" - Adam takes larger steps in x (flat direction)")print(" - Adam takes smaller steps in y (steep direction)")print(" - m provides momentum for consistent direction")print(" - v provides per-dimension scaling")print(" - Bias correction is crucial in early steps!")
Real-World Insight: Adam is the default optimizer in PyTorch and TensorFlow. Understanding its components helps you debug training issues - if loss is oscillating, you might need lower β1; if it’s not decreasing, you might need higher learning rate!
1. Learning Rate ← Most important! Try [1e-4, 1e-3, 1e-2, 0.1]2. Batch Size ← 16-64 for small data, 256-1024 for large3. Momentum/Beta1 ← Usually 0.9 is fine4. Beta2 ← Usually 0.999 is fine5. Weight Decay ← Try [0, 1e-4, 1e-2] if overfitting
Answer: Adam combines momentum (averages gradients) with adaptive learning rates (scales by gradient variance). This helps it:
Handle sparse gradients well
Converge faster with less tuning
Adapt to each parameter’s needs
However, SGD + momentum often generalizes better for computer vision - it’s just harder to tune!
What's the difference between Adam and AdamW?
Answer: AdamW decouples weight decay from the gradient update. In vanilla Adam, weight decay is applied to the gradient, which interacts poorly with the adaptive learning rate. AdamW applies weight decay directly to the weights:
Copy
# Adam (problematic)w -= lr * (gradient + weight_decay * w) / sqrt(v)# AdamW (correct) w -= lr * gradient / sqrt(v)w -= lr * weight_decay * w
How do you debug a model that's not training?
Answer:
First, verify the model can overfit a tiny subset (10 examples). If not, there’s a bug.
Check for NaN/Inf in gradients and activations
Try different learning rates (10x higher and 10x lower)
Check data preprocessing (normalization, labels)
Verify loss function is correct for the task
Plot gradient norms over time (should be stable, not exploding/vanishing)