Deep learning debugging is notoriously difficult — unlike a segfault that points you to the offending line, a neural network fails silently. Your training loop runs without errors, the loss decreases smoothly, and three days later you discover the model has learned something completely useless. There are no compiler warnings for “your labels are accidentally shuffled” or “your learning rate is 1000x too high.”An analogy: Debugging a neural network is like diagnosing a sick patient who cannot tell you their symptoms. You have to run tests (sanity checks), look at vital signs (gradient norms, loss curves), and use process of elimination. The best debuggers are not the ones who can read error messages — they are the ones who have a systematic checklist of things to verify before they ever start training.
import torchimport matplotlib.pyplot as pltdef get_gradient_norms(model): """Get gradient norms per layer.""" grad_norms = {} for name, param in model.named_parameters(): if param.grad is not None: grad_norms[name] = param.grad.norm().item() return grad_norms# During trainingfor epoch in range(epochs): for batch in dataloader: optimizer.zero_grad() loss = model(batch) loss.backward() # Check gradients before step grad_norms = get_gradient_norms(model) if any(norm > 100 for norm in grad_norms.values()): print("Warning: Large gradients detected!") # Clip gradients torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step()
def check_for_nan(model, loss): """Check for NaN in loss and parameters.""" if torch.isnan(loss) or torch.isinf(loss): print(f"Loss is {loss.item()}") return True for name, param in model.named_parameters(): if param.grad is not None: if torch.isnan(param.grad).any(): print(f"NaN gradient in {name}") return True if torch.isinf(param.grad).any(): print(f"Inf gradient in {name}") return True return False# Use anomaly detectiontorch.autograd.set_detect_anomaly(True) # Slow but catches issuestry: loss.backward()except RuntimeError as e: print(f"Backward pass failed: {e}")
This is the single most important debugging technique in deep learning. Before training on the full dataset, verify that your model can memorize a single batch of data to near-perfect accuracy. If it cannot, something is fundamentally broken — a bug in the model architecture, the loss function, the data pipeline, or the optimizer. Do not waste hours training on the full dataset until this test passes.Think of it as a smoke test: if the car will not start in the driveway, do not take it on the highway.
def overfit_single_batch(model, dataloader, epochs=100): """THE most important sanity check in deep learning. If the model cannot overfit one batch, something is fundamentally broken. This test catches: wrong loss function, broken forward pass, mismatched input/output dimensions, label encoding errors, and more. """ batch = next(iter(dataloader)) x, y = batch optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) for epoch in range(epochs): optimizer.zero_grad() output = model(x) loss = criterion(output, y) loss.backward() optimizer.step() if epoch % 10 == 0: acc = (output.argmax(1) == y).float().mean() print(f"Epoch {epoch}: Loss={loss.item():.4f}, Acc={acc.item():.4f}") # Should reach ~100% accuracy on this single batch final_acc = (model(x).argmax(1) == y).float().mean() assert final_acc > 0.99, f"Failed to overfit! Acc={final_acc}" print("Model can overfit a single batch -- forward pass and loss are correct")
If this test fails, check in this order: (1) Are the labels correct? (print a few and verify visually), (2) Are input dimensions correct? (print shapes at each layer), (3) Is the loss function appropriate for your task? (e.g., using BCE for multi-class instead of cross-entropy), (4) Is the learning rate too low? (try 1e-2 or even 1e-1).
A randomly initialized classifier should assign roughly equal probability to all classes. For cross-entropy loss with K classes, this means the initial loss should be approximately −log(1/K)=log(K). For CIFAR-10 (10 classes), expect ~2.30. For ImageNet (1000 classes), expect ~6.91. If your initial loss is significantly different, something is wrong with the model or the loss function.
def check_initial_loss(model, dataloader, num_classes): """Loss should be ~log(num_classes) for random weights. This catches: wrong number of output classes, broken softmax, incorrect loss function, biased initialization. """ model.eval() batch = next(iter(dataloader)) x, y = batch with torch.no_grad(): output = model(x) loss = F.cross_entropy(output, y) expected = -torch.log(torch.tensor(1.0 / num_classes)) print(f"Initial loss: {loss.item():.4f}") print(f"Expected (random): {expected.item():.4f}") if abs(loss.item() - expected.item()) > 0.5: print("WARNING: Initial loss is unexpected - check model initialization") print(" If loss is much HIGHER: output layer may have wrong dimensions") print(" If loss is much LOWER: model may have a bias toward certain classes")
Why this matters: If initial loss is 0.1 when it should be 2.3, your model is already “confident” before seeing any data — usually meaning the final layer bias is accidentally initialized to favor certain classes. If initial loss is 15.0 when it should be 6.9, the logits are likely unnormalized or the loss function is wrong.
def plot_loss_landscape(model, dataloader, resolution=20): """Visualize 2D loss landscape around current parameters.""" import copy # Get two random directions direction1 = [torch.randn_like(p) for p in model.parameters()] direction2 = [torch.randn_like(p) for p in model.parameters()] # Normalize directions d1_norm = torch.sqrt(sum((d ** 2).sum() for d in direction1)) d2_norm = torch.sqrt(sum((d ** 2).sum() for d in direction2)) direction1 = [d / d1_norm for d in direction1] direction2 = [d / d2_norm for d in direction2] # Save original parameters original_params = [p.clone() for p in model.parameters()] losses = torch.zeros(resolution, resolution) alphas = torch.linspace(-1, 1, resolution) betas = torch.linspace(-1, 1, resolution) batch = next(iter(dataloader)) x, y = batch for i, alpha in enumerate(alphas): for j, beta in enumerate(betas): # Perturb parameters for p, orig, d1, d2 in zip(model.parameters(), original_params, direction1, direction2): p.data = orig + alpha * d1 + beta * d2 with torch.no_grad(): loss = F.cross_entropy(model(x), y) losses[i, j] = loss.item() # Restore original parameters for p, orig in zip(model.parameters(), original_params): p.data = orig # Plot plt.figure(figsize=(8, 6)) plt.contourf(alphas.numpy(), betas.numpy(), losses.numpy(), levels=50) plt.colorbar(label='Loss') plt.xlabel('Direction 1') plt.ylabel('Direction 2') plt.title('Loss Landscape') plt.savefig('loss_landscape.png')
Are labels correct? Try overfitting a single batch.
Overfitting
Train loss decreasing, val loss increasing
Regularization, more data, smaller model
Add dropout, data augmentation, weight decay.
Underfitting
Both train and val loss high
Larger model, more training, check data quality
Is the model too small? Is the data noisy or mislabeled?
The number one debugging rule: Change one thing at a time. If you simultaneously increase the learning rate, add dropout, and change the architecture, you will never know which change had what effect. Disciplined, isolated experiments save more time than they cost.