Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Loss Functions

Loss Functions & Objectives

The Central Role of Loss Functions

A neural network learns by minimizing a loss function (also called objective function, cost function, or criterion). The loss function answers: “How wrong is my prediction?” Think of the loss function as the coach’s scoring rubric. Two different coaches might evaluate the same performance differently — one penalizes big mistakes harshly (MSE), another treats all mistakes equally (MAE). The rubric you choose shapes what the athlete (model) optimizes for. Choose the wrong rubric and you get a model that is technically “optimizing” but optimizing the wrong thing. This is why loss function selection is one of the most consequential design decisions in deep learning. Training=minimizeθ  L(fθ(X),Y)\text{Training} = \underset{\theta}{\text{minimize}} \; \mathcal{L}(f_\theta(X), Y) Where:
  • θ\theta = model parameters (weights and biases)
  • fθ(X)f_\theta(X) = model predictions
  • YY = true labels
  • L\mathcal{L} = loss function
Design Choice: Choosing the right loss function is a design decision that depends on:
  1. The type of problem (regression, classification, ranking)
  2. The output distribution you expect
  3. What kind of errors you care about most

Regression Loss Functions

Mean Squared Error (MSE)

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
import numpy as np
import matplotlib.pyplot as plt

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mse_gradient(y_true, y_pred):
    """Gradient w.r.t. y_pred"""
    return 2 * (y_pred - y_true) / len(y_true)

# Example
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.1, 2.2, 2.8, 4.1])

print(f"MSE Loss: {mse_loss(y_true, y_pred):.4f}")
Properties:
PropertyValue
Range[0, ∞)
OptimalWhen y_pred = y_true
GradientLinear in error
Outlier sensitivityHIGH (squared errors)
When to use:
  • Regression with Gaussian noise assumption (MSE is the maximum likelihood estimator when errors are normally distributed)
  • When you want to penalize large errors heavily — a prediction off by 10 is penalized 100x more than one off by 1
Mathematical intuition: Minimizing MSE is equivalent to maximizing the likelihood of the data under a Gaussian noise model. This is not just a convenient formula — there is a deep statistical justification. If your errors truly follow a bell curve, MSE is the optimal loss.

Mean Absolute Error (MAE / L1 Loss)

LMAE=1ni=1nyiy^i\mathcal{L}_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def mae_gradient(y_true, y_pred):
    return np.sign(y_pred - y_true) / len(y_true)

print(f"MAE Loss: {mae_loss(y_true, y_pred):.4f}")
Properties:
PropertyValue
Range[0, ∞)
GradientConstant magnitude
Outlier sensitivityLOW
ProblemNot differentiable at 0
When to use:
  • When outliers are expected
  • When you care about median prediction

Huber Loss (Smooth L1)

Combines the best of MSE and MAE: LHuber={12(yy^)2if yy^δδyy^12δ2otherwise\mathcal{L}_{\text{Huber}} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}
def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    squared_loss = 0.5 * error**2
    linear_loss = delta * np.abs(error) - 0.5 * delta**2
    return np.mean(np.where(is_small, squared_loss, linear_loss))
Regression Loss Functions

Classification Loss Functions

Binary Cross-Entropy (BCE)

For binary classification with output y^(0,1)\hat{y} \in (0, 1): LBCE=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Binary cross-entropy loss.
    
    Args:
        y_true: True labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
    """
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    loss = -np.mean(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )
    return loss

def bce_gradient(y_true, y_pred, epsilon=1e-15):
    """Gradient w.r.t. y_pred"""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred)) / len(y_true)

# Example
y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.2, 0.8, 0.9])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
Intuition: BCE measures the “surprise” the loss function experiences when it sees the true label given your prediction. If you predicted 0.99 for a positive example, there is almost no surprise — low loss. If you predicted 0.01 for a positive example, the loss function is shocked — the loss explodes toward infinity.
  • Confident and correct: Low loss (you predicted what happened)
  • Confident and wrong: Very high loss (the log penalty is brutal — this is by design)
  • Uncertain (0.5): Medium loss (you are admitting you do not know)
This asymmetric penalty structure is what makes cross-entropy so effective for classification: it ruthlessly punishes confident wrong answers, which forces the model to be calibrated rather than just accurate.

Categorical Cross-Entropy

For multi-class classification with KK classes: LCE=1ni=1nk=1Kyiklog(y^ik)\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log(\hat{y}_{ik})
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Categorical cross-entropy loss.
    
    Args:
        y_true: One-hot encoded labels (n, K)
        y_pred: Predicted probabilities (n, K)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def softmax(z):
    """Stable softmax"""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3 samples, 4 classes
y_true = np.array([
    [1, 0, 0, 0],  # Class 0
    [0, 1, 0, 0],  # Class 1
    [0, 0, 0, 1],  # Class 3
])

logits = np.array([
    [2.0, 0.5, 0.1, 0.1],
    [0.1, 1.5, 0.3, 0.2],
    [0.2, 0.1, 0.3, 1.8],
])

y_pred = softmax(logits)
print(f"Predictions:\n{y_pred}")
print(f"CE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}")

Cross-Entropy with Logits

In practice, we use the numerically stable version that combines softmax + cross-entropy. This is not just a convenience — it is a necessity.
Critical pitfall: Never compute softmax then log separately. When softmax outputs a value very close to 0, taking its log produces negative infinity. The combined LogSoftmax (or CrossEntropyLoss in PyTorch, which fuses both operations) uses the log-sum-exp trick to avoid this. In production, always use nn.CrossEntropyLoss(logits, targets) rather than nn.NLLLoss(F.log_softmax(logits), targets) — they are mathematically identical but the former is numerically safer.
def cross_entropy_with_logits(y_true, logits):
    """
    Numerically stable cross-entropy from logits.
    Uses log-sum-exp trick.
    """
    # y_true can be one-hot or class indices
    if y_true.ndim == 1:  # Class indices
        n = len(y_true)
        log_sum_exp = np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1))
        correct_logits = logits[np.arange(n), y_true]
        return -np.mean(correct_logits - np.max(logits, axis=1) - log_sum_exp)
    else:  # One-hot
        log_softmax = logits - np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1, keepdims=True)) - np.max(logits, axis=1, keepdims=True)
        return -np.mean(np.sum(y_true * log_softmax, axis=1))

Visualizing Loss Landscapes

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def plot_loss_landscape():
    """Visualize loss as a function of two parameters."""
    # Simple quadratic loss landscape
    w1 = np.linspace(-3, 3, 100)
    w2 = np.linspace(-3, 3, 100)
    W1, W2 = np.meshgrid(w1, w2)
    
    # Loss function: (w1 - 1)^2 + 2*(w2 - 0.5)^2
    L = (W1 - 1)**2 + 2*(W2 - 0.5)**2
    
    fig = plt.figure(figsize=(14, 5))
    
    # 3D surface
    ax1 = fig.add_subplot(121, projection='3d')
    ax1.plot_surface(W1, W2, L, cmap='viridis', alpha=0.8)
    ax1.set_xlabel('w₁')
    ax1.set_ylabel('w₂')
    ax1.set_zlabel('Loss')
    ax1.set_title('Loss Landscape (3D)')
    
    # Contour plot
    ax2 = fig.add_subplot(122)
    contour = ax2.contour(W1, W2, L, levels=20, cmap='viridis')
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.plot(1, 0.5, 'r*', markersize=15, label='Minimum')
    ax2.set_xlabel('w₁')
    ax2.set_ylabel('w₂')
    ax2.set_title('Loss Landscape (Contour)')
    ax2.legend()
    
    plt.tight_layout()
    plt.show()

plot_loss_landscape()
Loss Landscape Visualization

Advanced Loss Functions

Focal Loss

Addresses class imbalance by down-weighting easy examples: Lfocal=αt(1pt)γlog(pt)\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """
    Focal loss for handling class imbalance.
    
    Core idea: down-weight easy examples so the model focuses on hard ones.
    Standard CE gives equal weight to a 99%-confident correct prediction
    and a 51%-confident correct prediction. Focal loss says "the 99% one
    is already solved -- spend your gradient budget on the 51% one."
    
    Args:
        gamma: Focusing parameter (higher = more focus on hard examples)
               gamma=0 is standard CE; gamma=2 is the most common choice
        alpha: Class weight (balances positive vs negative classes)
    """
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    
    # Compute pt (probability of true class)
    pt = np.where(y_true == 1, y_pred, 1 - y_pred)
    
    # Focal weight
    focal_weight = (1 - pt) ** gamma
    
    # Alpha weight
    alpha_weight = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Cross-entropy
    ce = -np.log(pt)
    
    return np.mean(alpha_weight * focal_weight * ce)

# Compare BCE vs Focal Loss on imbalanced data
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])  # 10% positive class
y_pred = np.array([0.1, 0.2, 0.1, 0.15, 0.05, 0.1, 0.2, 0.1, 0.15, 0.7])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
print(f"Focal Loss: {focal_loss(y_true, y_pred):.4f}")

Label Smoothing

Prevents overconfidence by softening targets: yk={1ϵ+ϵKif k=true classϵKotherwisey'_k = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & \text{if } k = \text{true class} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}
def label_smoothing(y_true, n_classes, epsilon=0.1):
    """
    Apply label smoothing to one-hot labels.
    
    Instead of [1, 0, 0, 0], use [0.925, 0.025, 0.025, 0.025].
    This prevents the model from becoming infinitely confident,
    which improves generalization and calibration.
    """
    return y_true * (1 - epsilon) + epsilon / n_classes

# Example
y_hard = np.array([1, 0, 0, 0])  # Hard label: "I am 100% certain it is class 0"
y_smooth = label_smoothing(y_hard, n_classes=4, epsilon=0.1)
print(f"Hard label: {y_hard}")
print(f"Smooth label: {y_smooth}")
# Output: [0.925, 0.025, 0.025, 0.025] -- "I am 92.5% certain, with some humility"
Practical tip: Label smoothing with epsilon=0.1 is nearly free to implement and consistently improves performance by 0.2-0.5% on classification tasks. It is especially valuable when your training labels might contain noise — which in real-world datasets, they almost always do. Think of it as teaching the model epistemic humility.

Contrastive Loss

For learning embeddings where similar items are close: Lcontrastive=(1y)12d2+y12max(0,md)2\mathcal{L}_{\text{contrastive}} = (1-y) \cdot \frac{1}{2}d^2 + y \cdot \frac{1}{2}\max(0, m - d)^2
def contrastive_loss(embeddings1, embeddings2, labels, margin=1.0):
    """
    Contrastive loss for similarity learning.
    
    Args:
        embeddings1, embeddings2: Pair of embeddings
        labels: 0 if similar, 1 if dissimilar
        margin: Margin for dissimilar pairs
    """
    distances = np.linalg.norm(embeddings1 - embeddings2, axis=1)
    
    # Similar pairs: minimize distance
    similar_loss = (1 - labels) * 0.5 * distances**2
    
    # Dissimilar pairs: push apart beyond margin
    dissimilar_loss = labels * 0.5 * np.maximum(0, margin - distances)**2
    
    return np.mean(similar_loss + dissimilar_loss)

Triplet Loss

For learning embeddings with anchor-positive-negative triplets: Ltriplet=max(0,d(a,p)d(a,n)+m)\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + m)
def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    Triplet loss for embedding learning.
    
    Args:
        anchor: Anchor embeddings
        positive: Positive (similar) embeddings
        negative: Negative (dissimilar) embeddings
        margin: Margin between positive and negative
    """
    d_pos = np.linalg.norm(anchor - positive, axis=1)
    d_neg = np.linalg.norm(anchor - negative, axis=1)
    
    return np.mean(np.maximum(0, d_pos - d_neg + margin))

Loss Functions in PyTorch

import torch
import torch.nn as nn

# Regression losses
mse = nn.MSELoss()
mae = nn.L1Loss()
huber = nn.SmoothL1Loss()

# Classification losses
bce = nn.BCELoss()          # Requires sigmoid output
bce_logits = nn.BCEWithLogitsLoss()  # Raw logits (more stable)
ce = nn.CrossEntropyLoss()  # Includes softmax (raw logits)
nll = nn.NLLLoss()          # Requires log-softmax output

# Example usage
logits = torch.randn(10, 5)  # 10 samples, 5 classes
targets = torch.randint(0, 5, (10,))  # Class indices

loss = nn.CrossEntropyLoss()(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# With label smoothing (PyTorch 1.10+)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
loss_smooth = ce_smooth(logits, targets)
print(f"With Label Smoothing: {loss_smooth.item():.4f}")

Custom Loss Functions

class FocalLoss(nn.Module):
    """Focal Loss for class imbalance."""
    
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = nn.functional.cross_entropy(
            inputs, targets, reduction='none'
        )
        pt = torch.exp(-ce_loss)
        focal_loss = (1 - pt) ** self.gamma * ce_loss
        
        if self.alpha is not None:
            alpha_t = self.alpha[targets]
            focal_loss = alpha_t * focal_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

# Usage
criterion = FocalLoss(gamma=2.0)
loss = criterion(logits, targets)

Choosing the Right Loss Function

Decision Guide

TaskLoss FunctionOutput Activation
RegressionMSELinear
Regression with outliersHuber / MAELinear
Binary classificationBCESigmoid
Multi-class classificationCross-EntropySoftmax (or none with CE)
Multi-label classificationBCESigmoid
Imbalanced classificationFocal LossSoftmax
Similarity learningContrastive / TripletNormalized embeddings
Object detectionFocal + Smooth L1Various
Image generationPerceptual + L1Tanh

Common Mistakes

MistakeProblemSolution
Using MSE for classificationGradients vanish when sigmoid is saturated, so the model learns slowly from its worst predictionsUse cross-entropy — it gives strong gradients precisely when the model is confident and wrong
Forgetting epsilon in logNaN/Inf values that crash trainingClip predictions: y_pred = np.clip(y_pred, 1e-7, 1-1e-7) or use logits-based loss
Wrong activation + loss comboNumerical instability (sigmoid + NLLLoss, or manual softmax + CELoss)Use BCEWithLogitsLoss (includes sigmoid) or CrossEntropyLoss (includes softmax)
Ignoring class imbalanceModel predicts majority class for everything, achieves high accuracy but fails on the class you care aboutFocal loss, class-weighted loss, or oversampling the minority class
Debugging hint: If your loss suddenly goes to NaN during training, the most common culprits are: (1) log of zero or negative number, (2) division by zero in normalization, (3) learning rate too high causing weight explosion. Check your loss function’s edge cases first — add epsilon clipping to any log or division operation.

Exercises

Implement these losses from scratch and compare their gradients:
  1. Hinge loss: max(0,1yy^)\max(0, 1 - y \cdot \hat{y})
  2. Exponential loss: eyy^e^{-y \cdot \hat{y}}
  3. Logistic loss: log(1+eyy^)\log(1 + e^{-y \cdot \hat{y}})
Create a loss function for a model that simultaneously:
  1. Classifies images (cross-entropy)
  2. Predicts bounding boxes (smooth L1)
  3. Estimates uncertainty (KL divergence)
How do you weight the different losses?
Train a small network on a 2D classification task. Visualize the loss landscape:
  1. Along the line connecting initial and final weights
  2. In a random 2D plane around the minimum
What do you observe about the landscape shape?
On an imbalanced dataset (1:10 ratio):
  1. Train with BCE, Weighted BCE, and Focal Loss
  2. Tune the gamma parameter in Focal Loss
  3. Plot precision-recall curves for each
Which loss gives the best F1 score on the minority class?

Key Takeaways

ConceptKey Insight
Loss = ObjectiveWhat we optimize is what we get
MSE vs MAEMSE penalizes outliers more
Cross-EntropyInformation-theoretic, great for classification
Focal LossHandles class imbalance
Contrastive/TripletFor learning embeddings
Label SmoothingPrevents overconfidence

What’s Next

We’ve covered the foundations! Now let’s move to powerful architectures:

Module 6: Convolutional Neural Networks

The architecture that revolutionized computer vision — convolutions, filters, and feature maps.

Interview Deep-Dive

Strong Answer:
  • Cross-entropy is the negative log-likelihood of the data under the model’s predicted distribution. For a classification model that outputs probabilities via softmax, minimizing cross-entropy is mathematically equivalent to maximizing the likelihood of the correct labels — it is the maximum likelihood estimator for categorical data.
  • MSE for classification has two fundamental problems: (1) the gradient L/z\partial L / \partial z for MSE with sigmoid output is (y^y)σ(z)(\hat{y} - y) \cdot \sigma'(z), which includes the sigmoid derivative σ(z)\sigma'(z). When the model makes a confident wrong prediction (sigmoid saturated), σ(z)0\sigma'(z) \approx 0, so the gradient vanishes precisely when it should be largest. Cross-entropy’s gradient is simply y^y\hat{y} - y — no sigmoid derivative factor — so confident wrong predictions produce the largest gradients.
  • (2) MSE’s loss landscape for classification has many flat regions (where sigmoid is saturated) and is non-convex with respect to the logits. Cross-entropy’s landscape is convex with respect to the logits (for a fixed linear model), leading to smoother optimization.
  • Probabilistic interpretation: cross-entropy measures the KL divergence (plus a constant) between the true label distribution and the predicted distribution. It directly quantifies how surprised the model is by the true labels. MSE has no such information-theoretic interpretation for categorical data.
Follow-up: When might you legitimately use MSE for classification?In knowledge distillation, where the “labels” are soft probability distributions from a teacher model rather than hard 0/1 labels. MSE between the student’s logits and the teacher’s logits can work well because both are continuous-valued and the sigmoid saturation problem is less severe when targets are not at the extremes. Some mean-teacher methods use MSE loss on the consistency between augmented views. Also, in multi-label classification with many labels, MSE on the probability vectors can be more stable than BCE when the label space is very large and sparse.
Strong Answer:
  • Focal loss was introduced in the RetinaNet paper (Lin et al., 2017) to address the extreme class imbalance in object detection. In a typical image, there might be 100,000 background anchor boxes and only 10 foreground objects. Standard cross-entropy treats all examples equally, so the model is overwhelmed by the massive number of easy negatives.
  • The formula: FL(pt)=(1pt)γlog(pt)FL(p_t) = -(1 - p_t)^\gamma \log(p_t), where ptp_t is the predicted probability for the true class. The key innovation is the modulating factor (1pt)γ(1 - p_t)^\gamma.
  • When the model predicts correctly with high confidence (pt0.99p_t \approx 0.99), the factor (10.99)2=0.0001(1 - 0.99)^2 = 0.0001 essentially eliminates this example’s contribution to the loss. The model no longer wastes gradient budget on examples it already classifies easily.
  • When the model is wrong (pt0.1p_t \approx 0.1), the factor (10.1)2=0.81(1 - 0.1)^2 = 0.81 keeps the loss nearly at its full value. Hard examples dominate the training signal.
  • The gamma parameter controls the degree of focus. At γ=0\gamma = 0, focal loss reduces to standard cross-entropy. At γ=2\gamma = 2 (the most common setting), easy examples are down-weighted by 100×100\times or more. At γ=5\gamma = 5, the focusing is extreme and can make training unstable because too few examples contribute meaningful gradients.
  • In practice, γ=2\gamma = 2 combined with α\alpha class weighting is the standard recipe for detection and any heavily imbalanced classification problem.
Follow-up: What is the relationship between focal loss and hard example mining? Why is focal loss generally preferred?Hard example mining (OHEM) explicitly selects the top-k hardest examples per batch and only computes loss on those. Focal loss achieves a similar effect implicitly by continuously reweighting all examples based on difficulty. Focal loss is preferred because: (1) it is differentiable and integrates smoothly into standard training loops, (2) it considers all examples rather than discarding easy ones entirely, which provides a small but non-zero learning signal from easy examples, and (3) it does not require the additional sorting step that makes OHEM 20-30% slower. The continuous reweighting also adapts dynamically as the model improves — an example that was hard at epoch 5 may become easy by epoch 50 and automatically receive less weight.
Strong Answer:
  • Multi-task learning requires a combined loss: Ltotal=λ1Lcls+λ2LboxL_{total} = \lambda_1 L_{cls} + \lambda_2 L_{box}, where LclsL_{cls} is cross-entropy for classification and LboxL_{box} is smooth L1 (Huber loss) for bounding box regression. The challenge is balancing these losses so neither dominates.
  • Why smooth L1 for bounding boxes: smooth L1 is quadratic for small errors (providing strong gradients near the optimum) and linear for large errors (preventing outlier boxes from dominating training). Pure MSE would cause one badly predicted box to overwhelm the gradients for all well-predicted boxes.
  • Balancing loss terms: naive fixed-weight approaches (λ1=1,λ2=1\lambda_1 = 1, \lambda_2 = 1) fail because the loss scales and gradient magnitudes differ. Three principled approaches:
    • Manual tuning: start with equal weights, observe which task dominates (monitor individual loss curves), and adjust. Simple but tedious.
    • Uncertainty-based weighting (Kendall et al., 2018): learn the weight for each task as λi=1/(2σi2)\lambda_i = 1/(2\sigma_i^2) where σi\sigma_i is a learned task uncertainty. Tasks with higher uncertainty receive lower weight. This is mathematically principled (derived from multi-task likelihood).
    • GradNorm: normalize the gradient norms from each loss so they are approximately equal, preventing one task from dominating the shared representations.
  • In modern object detectors (YOLO, Faster R-CNN), the standard approach uses fixed weights that are tuned on the validation set, typically with the classification loss weighted lower than the box regression loss because classification is easier to learn.
Follow-up: What happens if the two losses conflict — optimizing one hurts the other?This is called negative transfer, and it occurs when the optimal shared representations for one task are suboptimal for the other. For example, fine-grained classification benefits from texture features, while bounding box regression benefits from shape features. Signs: total loss decreases but one individual loss increases. Fixes: (1) use task-specific heads with a shared backbone but stop gradients from one task flowing into the other’s head, (2) use a multi-gate mixture-of-experts architecture where each task selects different experts from the shared backbone, or (3) simply train separate models if the tasks are truly at odds. In practice, classification and box regression are well-aligned and negative transfer is rare for this specific combination.
Strong Answer:
  • Label smoothing replaces hard targets [1,0,0,0][1, 0, 0, 0] with soft targets [0.925,0.025,0.025,0.025][0.925, 0.025, 0.025, 0.025] (for ϵ=0.1\epsilon = 0.1 and K=4K = 4 classes). Instead of driving the model to produce infinite logits for the correct class, it encourages the model to produce high but finite confidence.
  • Why it improves generalization: (1) It prevents the model from becoming overconfident, which is a form of overfitting to the training labels. A model trained with hard labels can learn to output logits of magnitude 20+, which means tiny input perturbations can flip predictions. Label smoothing keeps logits moderate, producing more robust decision boundaries. (2) It implicitly regularizes by adding a uniform distribution component to the target, which is equivalent to adding a small KL-divergence penalty toward the uniform distribution. (3) It improves calibration — the predicted probabilities more accurately reflect the true uncertainty.
  • When it can hurt: (1) In knowledge distillation, where the student should learn to match the teacher’s sharp predictions exactly. Smoothing the teacher’s targets reduces the information content. (2) When labels are genuinely certain and the data is clean — e.g., mathematical theorem proving where there is exactly one correct answer. (3) In metric learning or contrastive learning where you need the model to distinguish between semantically similar classes with high precision.
  • Standard practice: use ϵ=0.1\epsilon = 0.1 for image classification and ϵ=0.1\epsilon = 0.1 for machine translation. It is essentially free to implement (one line in PyTorch: nn.CrossEntropyLoss(label_smoothing=0.1)) and provides 0.2-0.5% accuracy improvement on most benchmarks.
Follow-up: Label smoothing assigns equal probability to all incorrect classes. Is that always the right assumption?No, and this is a limitation. A cat image should have some probability mass on “dog” and “tiger” but very little on “airplane.” Knowledge distillation implicitly provides this structure — the teacher’s soft predictions encode inter-class similarity. Some approaches use structured label smoothing based on a class hierarchy or embedding distance, assigning more probability to semantically similar incorrect classes. In practice, the uniform assumption works surprisingly well because the model learns to ignore the uninformative uniform noise and focus on the correct class signal. The benefit comes primarily from preventing infinite-logit overconfidence, not from the specific distribution over incorrect classes.