Loss Functions & Objectives
The Central Role of Loss Functions
Regression Loss Functions
Mean Squared Error (MSE)
Mean Absolute Error (MAE / L1 Loss)
Huber Loss (Smooth L1)
Classification Loss Functions
Binary Cross-Entropy (BCE)
Categorical Cross-Entropy
Cross-Entropy with Logits
Visualizing Loss Landscapes
Advanced Loss Functions
Focal Loss
Label Smoothing
Contrastive Loss
Triplet Loss
Loss Functions in PyTorch
Custom Loss Functions
Choosing the Right Loss Function
Decision Guide
Common Mistakes
Exercises
Key Takeaways
What’s Next

Loss Functions & Objectives

The Central Role of Loss Functions

A neural network learns by minimizing a loss function (also called objective function, cost function, or criterion). The loss function answers: “How wrong is my prediction?”

\text{Training} = \underset{\theta}{\text{minimize}} \; \mathcal{L}(f_\theta(X), Y)

Where:

$\theta$ = model parameters (weights and biases)
$f_\theta(X)$ = model predictions
$Y$ = true labels
$\mathcal{L}$ = loss function

Design Choice: Choosing the right loss function is a design decision that depends on:

The type of problem (regression, classification, ranking)
The output distribution you expect
What kind of errors you care about most

Regression Loss Functions

Mean Squared Error (MSE)

\mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

import numpy as np
import matplotlib.pyplot as plt

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mse_gradient(y_true, y_pred):
    """Gradient w.r.t. y_pred"""
    return 2 * (y_pred - y_true) / len(y_true)

# Example
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.1, 2.2, 2.8, 4.1])

print(f"MSE Loss: {mse_loss(y_true, y_pred):.4f}")

Properties:

Property	Value
Range	[0, ∞)
Optimal	When y_pred = y_true
Gradient	Linear in error
Outlier sensitivity	HIGH (squared errors)

When to use:

Regression with Gaussian noise assumption
When you want to penalize large errors heavily

Mean Absolute Error (MAE / L1 Loss)

\mathcal{L}_{\text{MAE}} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def mae_gradient(y_true, y_pred):
    return np.sign(y_pred - y_true) / len(y_true)

print(f"MAE Loss: {mae_loss(y_true, y_pred):.4f}")

Properties:

Property	Value
Range	[0, ∞)
Gradient	Constant magnitude
Outlier sensitivity	LOW
Problem	Not differentiable at 0

When to use:

When outliers are expected
When you care about median prediction

Huber Loss (Smooth L1)

Combines the best of MSE and MAE:

\mathcal{L}_{\text{Huber}} = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    squared_loss = 0.5 * error**2
    linear_loss = delta * np.abs(error) - 0.5 * delta**2
    return np.mean(np.where(is_small, squared_loss, linear_loss))

Classification Loss Functions

Binary Cross-Entropy (BCE)

For binary classification with output

\hat{y} \in (0, 1)

\mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]

def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Binary cross-entropy loss.
    
    Args:
        y_true: True labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)
    """
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    loss = -np.mean(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )
    return loss

def bce_gradient(y_true, y_pred, epsilon=1e-15):
    """Gradient w.r.t. y_pred"""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred)) / len(y_true)

# Example
y_true = np.array([0, 0, 1, 1])
y_pred = np.array([0.1, 0.2, 0.8, 0.9])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")

Intuition: BCE measures the “information” needed to correct the prediction.

Confident and correct: Low loss
Confident and wrong: HIGH loss
Uncertain: Medium loss

Categorical Cross-Entropy

For multi-class classification with

K

classes:

\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log(\hat{y}_{ik})

def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Categorical cross-entropy loss.
    
    Args:
        y_true: One-hot encoded labels (n, K)
        y_pred: Predicted probabilities (n, K)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def softmax(z):
    """Stable softmax"""
    exp_z = np.exp(z - np.max(z, axis=-1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3 samples, 4 classes
y_true = np.array([
    [1, 0, 0, 0],  # Class 0
    [0, 1, 0, 0],  # Class 1
    [0, 0, 0, 1],  # Class 3
])

logits = np.array([
    [2.0, 0.5, 0.1, 0.1],
    [0.1, 1.5, 0.3, 0.2],
    [0.2, 0.1, 0.3, 1.8],
])

y_pred = softmax(logits)
print(f"Predictions:\n{y_pred}")
print(f"CE Loss: {categorical_cross_entropy(y_true, y_pred):.4f}")

Cross-Entropy with Logits

In practice, we use the numerically stable version that combines softmax + cross-entropy:

def cross_entropy_with_logits(y_true, logits):
    """
    Numerically stable cross-entropy from logits.
    Uses log-sum-exp trick.
    """
    # y_true can be one-hot or class indices
    if y_true.ndim == 1:  # Class indices
        n = len(y_true)
        log_sum_exp = np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1))
        correct_logits = logits[np.arange(n), y_true]
        return -np.mean(correct_logits - np.max(logits, axis=1) - log_sum_exp)
    else:  # One-hot
        log_softmax = logits - np.log(np.sum(np.exp(logits - np.max(logits, axis=1, keepdims=True)), axis=1, keepdims=True)) - np.max(logits, axis=1, keepdims=True)
        return -np.mean(np.sum(y_true * log_softmax, axis=1))

Visualizing Loss Landscapes

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def plot_loss_landscape():
    """Visualize loss as a function of two parameters."""
    # Simple quadratic loss landscape
    w1 = np.linspace(-3, 3, 100)
    w2 = np.linspace(-3, 3, 100)
    W1, W2 = np.meshgrid(w1, w2)
    
    # Loss function: (w1 - 1)^2 + 2*(w2 - 0.5)^2
    L = (W1 - 1)**2 + 2*(W2 - 0.5)**2
    
    fig = plt.figure(figsize=(14, 5))
    
    # 3D surface
    ax1 = fig.add_subplot(121, projection='3d')
    ax1.plot_surface(W1, W2, L, cmap='viridis', alpha=0.8)
    ax1.set_xlabel('w₁')
    ax1.set_ylabel('w₂')
    ax1.set_zlabel('Loss')
    ax1.set_title('Loss Landscape (3D)')
    
    # Contour plot
    ax2 = fig.add_subplot(122)
    contour = ax2.contour(W1, W2, L, levels=20, cmap='viridis')
    ax2.clabel(contour, inline=True, fontsize=8)
    ax2.plot(1, 0.5, 'r*', markersize=15, label='Minimum')
    ax2.set_xlabel('w₁')
    ax2.set_ylabel('w₂')
    ax2.set_title('Loss Landscape (Contour)')
    ax2.legend()
    
    plt.tight_layout()
    plt.show()

plot_loss_landscape()

Advanced Loss Functions

Focal Loss

Addresses class imbalance by down-weighting easy examples:

\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """
    Focal loss for handling class imbalance.
    
    Args:
        gamma: Focusing parameter (higher = more focus on hard examples)
        alpha: Class weight
    """
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    
    # Compute pt (probability of true class)
    pt = np.where(y_true == 1, y_pred, 1 - y_pred)
    
    # Focal weight
    focal_weight = (1 - pt) ** gamma
    
    # Alpha weight
    alpha_weight = np.where(y_true == 1, alpha, 1 - alpha)
    
    # Cross-entropy
    ce = -np.log(pt)
    
    return np.mean(alpha_weight * focal_weight * ce)

# Compare BCE vs Focal Loss on imbalanced data
y_true = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])  # 10% positive class
y_pred = np.array([0.1, 0.2, 0.1, 0.15, 0.05, 0.1, 0.2, 0.1, 0.15, 0.7])

print(f"BCE Loss: {binary_cross_entropy(y_true, y_pred):.4f}")
print(f"Focal Loss: {focal_loss(y_true, y_pred):.4f}")

Label Smoothing

Prevents overconfidence by softening targets:

y'_k = \begin{cases} 1 - \epsilon + \frac{\epsilon}{K} & \text{if } k = \text{true class} \\ \frac{\epsilon}{K} & \text{otherwise} \end{cases}

def label_smoothing(y_true, n_classes, epsilon=0.1):
    """
    Apply label smoothing to one-hot labels.
    """
    return y_true * (1 - epsilon) + epsilon / n_classes

# Example
y_hard = np.array([1, 0, 0, 0])  # Hard label
y_smooth = label_smoothing(y_hard, n_classes=4, epsilon=0.1)
print(f"Hard label: {y_hard}")
print(f"Smooth label: {y_smooth}")

Contrastive Loss

For learning embeddings where similar items are close:

\mathcal{L}_{\text{contrastive}} = (1-y) \cdot \frac{1}{2}d^2 + y \cdot \frac{1}{2}\max(0, m - d)^2

def contrastive_loss(embeddings1, embeddings2, labels, margin=1.0):
    """
    Contrastive loss for similarity learning.
    
    Args:
        embeddings1, embeddings2: Pair of embeddings
        labels: 0 if similar, 1 if dissimilar
        margin: Margin for dissimilar pairs
    """
    distances = np.linalg.norm(embeddings1 - embeddings2, axis=1)
    
    # Similar pairs: minimize distance
    similar_loss = (1 - labels) * 0.5 * distances**2
    
    # Dissimilar pairs: push apart beyond margin
    dissimilar_loss = labels * 0.5 * np.maximum(0, margin - distances)**2
    
    return np.mean(similar_loss + dissimilar_loss)

Triplet Loss

For learning embeddings with anchor-positive-negative triplets:

\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + m)

def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    Triplet loss for embedding learning.
    
    Args:
        anchor: Anchor embeddings
        positive: Positive (similar) embeddings
        negative: Negative (dissimilar) embeddings
        margin: Margin between positive and negative
    """
    d_pos = np.linalg.norm(anchor - positive, axis=1)
    d_neg = np.linalg.norm(anchor - negative, axis=1)
    
    return np.mean(np.maximum(0, d_pos - d_neg + margin))

Loss Functions in PyTorch

import torch
import torch.nn as nn

# Regression losses
mse = nn.MSELoss()
mae = nn.L1Loss()
huber = nn.SmoothL1Loss()

# Classification losses
bce = nn.BCELoss()          # Requires sigmoid output
bce_logits = nn.BCEWithLogitsLoss()  # Raw logits (more stable)
ce = nn.CrossEntropyLoss()  # Includes softmax (raw logits)
nll = nn.NLLLoss()          # Requires log-softmax output

# Example usage
logits = torch.randn(10, 5)  # 10 samples, 5 classes
targets = torch.randint(0, 5, (10,))  # Class indices

loss = nn.CrossEntropyLoss()(logits, targets)
print(f"Cross-Entropy Loss: {loss.item():.4f}")

# With label smoothing (PyTorch 1.10+)
ce_smooth = nn.CrossEntropyLoss(label_smoothing=0.1)
loss_smooth = ce_smooth(logits, targets)
print(f"With Label Smoothing: {loss_smooth.item():.4f}")

Custom Loss Functions

class FocalLoss(nn.Module):
    """Focal Loss for class imbalance."""
    
    def __init__(self, gamma=2.0, alpha=None, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = nn.functional.cross_entropy(
            inputs, targets, reduction='none'
        )
        pt = torch.exp(-ce_loss)
        focal_loss = (1 - pt) ** self.gamma * ce_loss
        
        if self.alpha is not None:
            alpha_t = self.alpha[targets]
            focal_loss = alpha_t * focal_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        return focal_loss

# Usage
criterion = FocalLoss(gamma=2.0)
loss = criterion(logits, targets)

Choosing the Right Loss Function

Decision Guide

Task	Loss Function	Output Activation
Regression	MSE	Linear
Regression with outliers	Huber / MAE	Linear
Binary classification	BCE	Sigmoid
Multi-class classification	Cross-Entropy	Softmax (or none with CE)
Multi-label classification	BCE	Sigmoid
Imbalanced classification	Focal Loss	Softmax
Similarity learning	Contrastive / Triplet	Normalized embeddings
Object detection	Focal + Smooth L1	Various
Image generation	Perceptual + L1	Tanh

Common Mistakes

Mistake	Problem	Solution
Using MSE for classification	Doesn’t work well	Use cross-entropy
Forgetting epsilon in log	NaN/Inf values	Clip predictions
Wrong activation + loss combo	Numerical instability	Use LogitsLoss versions
Ignoring class imbalance	Poor minority class performance	Focal loss, weighted loss

Exercises

Exercise 1: Implement and Compare

Implement these losses from scratch and compare their gradients:

Hinge loss: $\max(0, 1 - y \cdot \hat{y})$
Exponential loss: $e^{-y \cdot \hat{y}}$
Logistic loss: $\log(1 + e^{-y \cdot \hat{y}})$

Exercise 2: Custom Multi-Task Loss

Create a loss function for a model that simultaneously:

Classifies images (cross-entropy)
Predicts bounding boxes (smooth L1)
Estimates uncertainty (KL divergence)

How do you weight the different losses?

Exercise 3: Loss Landscape Visualization

Train a small network on a 2D classification task. Visualize the loss landscape:

Along the line connecting initial and final weights
In a random 2D plane around the minimum

What do you observe about the landscape shape?

Exercise 4: Focal Loss Tuning

On an imbalanced dataset (1:10 ratio):

Train with BCE, Weighted BCE, and Focal Loss
Tune the gamma parameter in Focal Loss
Plot precision-recall curves for each

Which loss gives the best F1 score on the minority class?

Key Takeaways

Concept	Key Insight
Loss = Objective	What we optimize is what we get
MSE vs MAE	MSE penalizes outliers more
Cross-Entropy	Information-theoretic, great for classification
Focal Loss	Handles class imbalance
Contrastive/Triplet	For learning embeddings
Label Smoothing	Prevents overconfidence

What’s Next

We’ve covered the foundations! Now let’s move to powerful architectures:

Module 6: Convolutional Neural Networks

The architecture that revolutionized computer vision — convolutions, filters, and feature maps.

Activation Functions CNNs

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Loss Functions & Objectives

​The Central Role of Loss Functions

​Regression Loss Functions

​Mean Squared Error (MSE)

​Mean Absolute Error (MAE / L1 Loss)

​Huber Loss (Smooth L1)

​Classification Loss Functions

​Binary Cross-Entropy (BCE)

​Categorical Cross-Entropy

​Cross-Entropy with Logits

​Visualizing Loss Landscapes

​Advanced Loss Functions

​Focal Loss

​Label Smoothing

​Contrastive Loss

​Triplet Loss

​Loss Functions in PyTorch

​Custom Loss Functions

​Choosing the Right Loss Function

​Decision Guide

​Common Mistakes

​Exercises

​Key Takeaways

​What’s Next