Debugging Deep Learning
Common Training Failures
Deep learning debugging is notoriously difficult — models fail silently or in confusing ways.| Symptom | Possible Causes |
|---|---|
| Loss is NaN | Exploding gradients, bad learning rate, log(0) |
| Loss doesn’t decrease | LR too low, bug in loss, wrong labels |
| Loss decreases then plateaus | Needs LR decay, underfitting |
| Val loss increases (train decreases) | Overfitting |
| Accuracy stuck at random | Labels shuffled wrong, bug in model |
Gradient Health Checks
Monitor Gradient Norms
Visualize Gradient Flow
Detecting NaN/Inf
Sanity Checks
1. Overfit a Single Batch
2. Check Data Pipeline
3. Verify Loss at Initialization
Loss Landscape Visualization
Common Fixes
| Problem | Fix |
|---|---|
| Exploding gradients | Gradient clipping, lower LR, layer norm |
| Vanishing gradients | Residual connections, better initialization |
| Loss is NaN | Check for log(0), division by zero |
| Not learning | Verify data, check loss function, increase LR |
| Overfitting | Regularization, more data, smaller model |
| Underfitting | Larger model, more training, check data |
Debugging Toolkit
Exercises
Exercise 1: Debug a Broken Model
Exercise 1: Debug a Broken Model
Given a model that produces NaN loss, use debugging techniques to find and fix the issue.
Exercise 2: Gradient Flow Analysis
Exercise 2: Gradient Flow Analysis
Implement gradient flow visualization for a deep network. Identify vanishing gradients.
Exercise 3: Loss Landscape
Exercise 3: Loss Landscape
Generate loss landscape visualizations for networks with and without batch normalization.