Skip to main content
Loss Functions

Loss Functions & Objectives

The Central Role of Loss Functions

A neural network learns by minimizing a loss function (also called objective function, cost function, or criterion). The loss function answers: “How wrong is my prediction?” Training=minimizeθ  L(fθ(X),Y)\text{Training} = \underset{\theta}{\text{minimize}} \; \mathcal{L}(f_\theta(X), Y) Where:
  • θ\theta = model parameters (weights and biases)
  • fθ(X)f_\theta(X) = model predictions
  • YY = true labels
  • L\mathcal{L} = loss function
Design Choice: Choosing the right loss function is a design decision that depends on:
  1. The type of problem (regression, classification, ranking)
  2. The output distribution you expect
  3. What kind of errors you care about most

Regression Loss Functions

Mean Squared Error (MSE)

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
import numpy as np
import matplotlib.pyplot as plt

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mse_gradient(y_true, y_pred):
    """Gradient w.r.t. y_pred"""
    return 2 * (y_pred - y_true) / len(y_true)

# Example
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.1, 2.2, 2.8, 4.1])

print(f"MSE Loss: {mse_loss(y_true, y_pred):.4f}")
Properties:
PropertyValue
Range[0, ∞)
OptimalWhen y_pred = y_true
GradientLinear in error
Outlier sensitivityHIGH (squared errors)
When to use:
  • Regression with Gaussian noise assumption
  • When you want to penalize large errors heavily

Mean Absolute Error (MAE / L1 Loss)

LMAE=1ni=1nyiy^i\mathcal{L}_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def mae_gradient(y_true, y_pred):
    return np.sign(y_pred - y_true) / len(y_true)

print(f"MAE Loss: {mae_loss(y_true, y_pred):.4f}")
Properties:
PropertyValue
Range[0, ∞)
GradientConstant magnitude
Outlier sensitivityLOW
ProblemNot differentiable at 0
When to use:
  • When outliers are expected
  • When you care about median prediction

Huber Loss (Smooth L1)

Combines the best of MSE and MAE: LHuber={12(yy^)2if yy^δδyy^12δ2otherwise\mathcal{L}_{\text{Huber}} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}
def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    squared_loss = 0.5 * error**2
    linear_loss = delta * np.abs(error) - 0.5 * delta**2
    return np.mean(np.where(is_small, squared_loss, linear_loss))
Regression Loss Functions

Classification Loss Functions

Binary Cross-Entropy (BCE)

For binary classification with output y^(0,1)\hat{y} \in (0, 1): LBCE=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Binary cross-entropy loss.
    
    Args:
        y_true: True labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
    """
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    loss = -np.mean(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )
    return loss

def bce_gradient(y_true, y_pred, epsilon=1e-15):
    """Gradient w.r.t. y_pred"""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred)) / len(y_true)

# Example
y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.2, 0.8, 0.9])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
Intuition: BCE measures the “information” needed to correct the prediction.
  • Confident and correct: Low loss
  • Confident and wrong: HIGH loss
  • Uncertain: Medium loss

Categorical Cross-Entropy

For multi-class classification with KK classes: LCE=1ni=1nk=1Kyiklog(y^ik)\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log(\hat{y}_{ik})
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Categorical cross-entropy loss.
    
    Args:
        y_true: One-hot encoded labels (n, K)
        y_pred: Predicted probabilities (n, K)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def softmax(z):
    """Stable softmax"""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3 samples, 4 classes
y_true = np.array([
    [1, 0, 0, 0],  # Class 0
    [0, 1, 0, 0],  # Class 1
    [0, 0, 0, 1],  # Class 3
])

logits = np.array([
    [2.0, 0.5, 0.1, 0.1],
    [0.1, 1.5, 0.3, 0.2],
    [0.2, 0.1, 0.3, 1.8],
])

y_pred = softmax(logits)
print(f"Predictions:\n{y_pred}")
print(f"CE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}")

Cross-Entropy with Logits

In practice, we use the numerically stable version that combines softmax + cross-entropy:
def cross_entropy_with_logits(y_true, logits):
    """
    Numerically stable cross-entropy from logits.
    Uses log-sum-exp trick.
    """
    # y_true can be one-hot or class indices
    if y_true.ndim == 1:  # Class indices
        n = len(y_true)
        log_sum_exp = np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1))
        correct_logits = logits[np.arange(n), y_true]
        return -np.mean(correct_logits - np.max(logits, axis=1) - log_sum_exp)
    else:  # One-hot
        log_softmax = logits - np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1, keepdims=True)) - np.max(logits, axis=1, keepdims=True)
        return -np.mean(np.sum(y_true * log_softmax, axis=1))

Visualizing Loss Landscapes

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def plot_loss_landscape():
    """Visualize loss as a function of two parameters."""
    # Simple quadratic loss landscape
    w1 = np.linspace(-3, 3, 100)
    w2 = np.linspace(-3, 3, 100)
    W1, W2 = np.meshgrid(w1, w2)
    
    # Loss function: (w1 - 1)^2 + 2*(w2 - 0.5)^2
    L = (W1 - 1)**2 + 2*(W2 - 0.5)**2
    
    fig = plt.figure(figsize=(14, 5))
    
    # 3D surface
    ax1 = fig.add_subplot(121, projection='3d')
    ax1.plot_surface(W1, W2, L, cmap='viridis', alpha=0.8)
    ax1.set_xlabel('w₁')
    ax1.set_ylabel('w₂')
    ax1.set_zlabel('Loss')
    ax1.set_title('Loss Landscape (3D)')
    
    # Contour plot
    ax2 = fig.add_subplot(122)
    contour = ax2.contour(W1, W2, L, levels=20, cmap='viridis')
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.plot(1, 0.5, 'r*', markersize=15, label='Minimum')
    ax2.set_xlabel('w₁')
    ax2.set_ylabel('w₂')
    ax2.set_title('Loss Landscape (Contour)')
    ax2.legend()
    
    plt.tight_layout()
    plt.show()

plot_loss_landscape()
Loss Landscape Visualization

Advanced Loss Functions

Focal Loss

Addresses class imbalance by down-weighting easy examples: Lfocal=αt(1pt)γlog(pt)\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """
    Focal loss for handling class imbalance.
    
    Args:
        gamma: Focusing parameter (higher = more focus on hard examples)
        alpha: Class weight
    """
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    
    # Compute pt (probability of true class)
    pt = np.where(y_true == 1, y_pred, 1 - y_pred)
    
    # Focal weight
    focal_weight = (1 - pt) ** gamma
    
    # Alpha weight
    alpha_weight = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Cross-entropy
    ce = -np.log(pt)
    
    return np.mean(alpha_weight * focal_weight * ce)

# Compare BCE vs Focal Loss on imbalanced data
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])  # 10% positive class
y_pred = np.array([0.1, 0.2, 0.1, 0.15, 0.05, 0.1, 0.2, 0.1, 0.15, 0.7])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
print(f"Focal Loss: {focal_loss(y_true, y_pred):.4f}")

Label Smoothing

Prevents overconfidence by softening targets: yk={1ϵ+ϵKif k=true classϵKotherwisey'_k = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & \text{if } k = \text{true class} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}
def label_smoothing(y_true, n_classes, epsilon=0.1):
    """
    Apply label smoothing to one-hot labels.
    """
    return y_true * (1 - epsilon) + epsilon / n_classes

# Example
y_hard = np.array([1, 0, 0, 0])  # Hard label
y_smooth = label_smoothing(y_hard, n_classes=4, epsilon=0.1)
print(f"Hard label: {y_hard}")
print(f"Smooth label: {y_smooth}")

Contrastive Loss

For learning embeddings where similar items are close: Lcontrastive=(1y)12d2+y12max(0,md)2\mathcal{L}_{\text{contrastive}} = (1-y) \cdot \frac{1}{2}d^2 + y \cdot \frac{1}{2}\max(0, m - d)^2
def contrastive_loss(embeddings1, embeddings2, labels, margin=1.0):
    """
    Contrastive loss for similarity learning.
    
    Args:
        embeddings1, embeddings2: Pair of embeddings
        labels: 0 if similar, 1 if dissimilar
        margin: Margin for dissimilar pairs
    """
    distances = np.linalg.norm(embeddings1 - embeddings2, axis=1)
    
    # Similar pairs: minimize distance
    similar_loss = (1 - labels) * 0.5 * distances**2
    
    # Dissimilar pairs: push apart beyond margin
    dissimilar_loss = labels * 0.5 * np.maximum(0, margin - distances)**2
    
    return np.mean(similar_loss + dissimilar_loss)

Triplet Loss

For learning embeddings with anchor-positive-negative triplets: Ltriplet=max(0,d(a,p)d(a,n)+m)\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + m)
def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    Triplet loss for embedding learning.
    
    Args:
        anchor: Anchor embeddings
        positive: Positive (similar) embeddings
        negative: Negative (dissimilar) embeddings
        margin: Margin between positive and negative
    """
    d_pos = np.linalg.norm(anchor - positive, axis=1)
    d_neg = np.linalg.norm(anchor - negative, axis=1)
    
    return np.mean(np.maximum(0, d_pos - d_neg + margin))

Loss Functions in PyTorch

import torch
import torch.nn as nn

# Regression losses
mse = nn.MSELoss()
mae = nn.L1Loss()
huber = nn.SmoothL1Loss()

# Classification losses
bce = nn.BCELoss()          # Requires sigmoid output
bce_logits = nn.BCEWithLogitsLoss()  # Raw logits (more stable)
ce = nn.CrossEntropyLoss()  # Includes softmax (raw logits)
nll = nn.NLLLoss()          # Requires log-softmax output

# Example usage
logits = torch.randn(10, 5)  # 10 samples, 5 classes
targets = torch.randint(0, 5, (10,))  # Class indices

loss = nn.CrossEntropyLoss()(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# With label smoothing (PyTorch 1.10+)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
loss_smooth = ce_smooth(logits, targets)
print(f"With Label Smoothing: {loss_smooth.item():.4f}")

Custom Loss Functions

class FocalLoss(nn.Module):
    """Focal Loss for class imbalance."""
    
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = nn.functional.cross_entropy(
            inputs, targets, reduction='none'
        )
        pt = torch.exp(-ce_loss)
        focal_loss = (1 - pt) ** self.gamma * ce_loss
        
        if self.alpha is not None:
            alpha_t = self.alpha[targets]
            focal_loss = alpha_t * focal_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

# Usage
criterion = FocalLoss(gamma=2.0)
loss = criterion(logits, targets)

Choosing the Right Loss Function

Decision Guide

TaskLoss FunctionOutput Activation
RegressionMSELinear
Regression with outliersHuber / MAELinear
Binary classificationBCESigmoid
Multi-class classificationCross-EntropySoftmax (or none with CE)
Multi-label classificationBCESigmoid
Imbalanced classificationFocal LossSoftmax
Similarity learningContrastive / TripletNormalized embeddings
Object detectionFocal + Smooth L1Various
Image generationPerceptual + L1Tanh

Common Mistakes

MistakeProblemSolution
Using MSE for classificationDoesn’t work wellUse cross-entropy
Forgetting epsilon in logNaN/Inf valuesClip predictions
Wrong activation + loss comboNumerical instabilityUse LogitsLoss versions
Ignoring class imbalancePoor minority class performanceFocal loss, weighted loss

Exercises

Implement these losses from scratch and compare their gradients:
  1. Hinge loss: max(0,1yy^)\max(0, 1 - y \cdot \hat{y})
  2. Exponential loss: eyy^e^{-y \cdot \hat{y}}
  3. Logistic loss: log(1+eyy^)\log(1 + e^{-y \cdot \hat{y}})
Create a loss function for a model that simultaneously:
  1. Classifies images (cross-entropy)
  2. Predicts bounding boxes (smooth L1)
  3. Estimates uncertainty (KL divergence)
How do you weight the different losses?
Train a small network on a 2D classification task. Visualize the loss landscape:
  1. Along the line connecting initial and final weights
  2. In a random 2D plane around the minimum
What do you observe about the landscape shape?
On an imbalanced dataset (1:10 ratio):
  1. Train with BCE, Weighted BCE, and Focal Loss
  2. Tune the gamma parameter in Focal Loss
  3. Plot precision-recall curves for each
Which loss gives the best F1 score on the minority class?

Key Takeaways

ConceptKey Insight
Loss = ObjectiveWhat we optimize is what we get
MSE vs MAEMSE penalizes outliers more
Cross-EntropyInformation-theoretic, great for classification
Focal LossHandles class imbalance
Contrastive/TripletFor learning embeddings
Label SmoothingPrevents overconfidence

What’s Next

We’ve covered the foundations! Now let’s move to powerful architectures: