Loss Functions & Objectives
The Central Role of Loss Functions
A neural network learns by minimizing a loss function (also called objective function, cost function, or criterion). The loss function answers: “How wrong is my prediction?” Where:- = model parameters (weights and biases)
- = model predictions
- = true labels
- = loss function
Design Choice: Choosing the right loss function is a design decision that depends on:
- The type of problem (regression, classification, ranking)
- The output distribution you expect
- What kind of errors you care about most
Regression Loss Functions
Mean Squared Error (MSE)
| Property | Value |
|---|---|
| Range | [0, ∞) |
| Optimal | When y_pred = y_true |
| Gradient | Linear in error |
| Outlier sensitivity | HIGH (squared errors) |
- Regression with Gaussian noise assumption
- When you want to penalize large errors heavily
Mean Absolute Error (MAE / L1 Loss)
| Property | Value |
|---|---|
| Range | [0, ∞) |
| Gradient | Constant magnitude |
| Outlier sensitivity | LOW |
| Problem | Not differentiable at 0 |
- When outliers are expected
- When you care about median prediction
Huber Loss (Smooth L1)
Combines the best of MSE and MAE:Classification Loss Functions
Binary Cross-Entropy (BCE)
For binary classification with output :- Confident and correct: Low loss
- Confident and wrong: HIGH loss
- Uncertain: Medium loss
Categorical Cross-Entropy
For multi-class classification with classes:Cross-Entropy with Logits
In practice, we use the numerically stable version that combines softmax + cross-entropy:Visualizing Loss Landscapes
Advanced Loss Functions
Focal Loss
Addresses class imbalance by down-weighting easy examples:Label Smoothing
Prevents overconfidence by softening targets:Contrastive Loss
For learning embeddings where similar items are close:Triplet Loss
For learning embeddings with anchor-positive-negative triplets:Loss Functions in PyTorch
Custom Loss Functions
Choosing the Right Loss Function
Decision Guide
| Task | Loss Function | Output Activation |
|---|---|---|
| Regression | MSE | Linear |
| Regression with outliers | Huber / MAE | Linear |
| Binary classification | BCE | Sigmoid |
| Multi-class classification | Cross-Entropy | Softmax (or none with CE) |
| Multi-label classification | BCE | Sigmoid |
| Imbalanced classification | Focal Loss | Softmax |
| Similarity learning | Contrastive / Triplet | Normalized embeddings |
| Object detection | Focal + Smooth L1 | Various |
| Image generation | Perceptual + L1 | Tanh |
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Using MSE for classification | Doesn’t work well | Use cross-entropy |
| Forgetting epsilon in log | NaN/Inf values | Clip predictions |
| Wrong activation + loss combo | Numerical instability | Use LogitsLoss versions |
| Ignoring class imbalance | Poor minority class performance | Focal loss, weighted loss |
Exercises
Exercise 1: Implement and Compare
Exercise 1: Implement and Compare
Implement these losses from scratch and compare their gradients:
- Hinge loss:
- Exponential loss:
- Logistic loss:
Exercise 2: Custom Multi-Task Loss
Exercise 2: Custom Multi-Task Loss
Create a loss function for a model that simultaneously:
- Classifies images (cross-entropy)
- Predicts bounding boxes (smooth L1)
- Estimates uncertainty (KL divergence)
Exercise 3: Loss Landscape Visualization
Exercise 3: Loss Landscape Visualization
Train a small network on a 2D classification task. Visualize the loss landscape:
- Along the line connecting initial and final weights
- In a random 2D plane around the minimum
Exercise 4: Focal Loss Tuning
Exercise 4: Focal Loss Tuning
On an imbalanced dataset (1:10 ratio):
- Train with BCE, Weighted BCE, and Focal Loss
- Tune the gamma parameter in Focal Loss
- Plot precision-recall curves for each
Key Takeaways
| Concept | Key Insight |
|---|---|
| Loss = Objective | What we optimize is what we get |
| MSE vs MAE | MSE penalizes outliers more |
| Cross-Entropy | Information-theoretic, great for classification |
| Focal Loss | Handles class imbalance |
| Contrastive/Triplet | For learning embeddings |
| Label Smoothing | Prevents overconfidence |