Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Regularization Techniques

Regularization for Deep Networks

The Overfitting Problem

Here is a paradox: deep networks are powerful because they have millions of parameters, but those same millions of parameters are a curse — given enough capacity, the network will memorize the training data (including its noise and mistakes) rather than learning generalizable patterns. A ResNet-50 with 25 million parameters can easily memorize the entire CIFAR-10 dataset of 50,000 images; what you want is for it to learn the concept of “cat” vs “dog,” not pixel-perfect recall of every training image. An analogy: Imagine a student who memorizes every answer in the textbook word-for-word. They ace the practice problems but fail the exam because the questions are worded differently. Regularization is like telling the student “you cannot take notes into the exam” — it forces them to understand the underlying concepts rather than memorize surface patterns. Regularization constrains the model, making memorization harder and forcing the network to learn simpler, more generalizable patterns. There is no single “best” regularizer — you typically combine several techniques, each attacking overfitting from a different angle.

Weight Decay (L2 Regularization)

The simplest form of regularization: add a penalty on weight magnitude to the loss function. Think of it as a tax on complexity — the bigger the weights, the higher the tax. This pushes the network toward solutions with smaller weights, which tend to be smoother and more generalizable. Add penalty on weight magnitude to loss: Ltotal=Ltask+λ2iwi2\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \frac{\lambda}{2} \sum_i w_i^2 Effect: Pushes weights toward zero, preventing extreme values. Geometrically, it constrains the weight vector to a ball centered at the origin. The penalty is quadratic, so large weights are penalized much more than small ones.
import torch.optim as optim

# Apply weight decay in optimizer -- the modern, correct way
optimizer = optim.AdamW(
    model.parameters(),
    lr=1e-4,
    weight_decay=0.01  # Typical range: 0.01 to 0.1
)
AdamW vs Adam + L2: These are NOT the same thing. Classic Adam applies L2 regularization before the adaptive scaling, which means the effective regularization depends on the gradient history — parameters with large gradients get less regularization. AdamW applies weight decay after the Adam step, so every parameter is decayed equally regardless of gradient magnitude. This matters in practice: AdamW gives consistently better results for Transformers and most modern architectures. Always use AdamW.
Pitfall — applying weight decay to bias and normalization parameters: Weight decay should typically NOT be applied to bias terms or normalization layer parameters (gamma/beta). These have so few parameters that regularizing them hurts more than helps. Most frameworks apply weight decay to all parameters by default — you should explicitly exclude biases and norm layers. This is a common 0.5-1% accuracy difference that many practitioners miss.

Dropout

Randomly zero activations during training:
import torch
import torch.nn as nn

class DropoutFromScratch(nn.Module):
    """Dropout implementation from scratch."""
    
    def __init__(self, p=0.5):
        super().__init__()
        self.p = p
    
    def forward(self, x):
        if self.training:
            mask = (torch.rand_like(x) > self.p).float()
            return x * mask / (1 - self.p)  # Scale to maintain expectation
        return x
Why it works: By randomly silencing neurons during each forward pass, dropout forces the network to learn redundant representations — no single neuron can become a critical bottleneck. Each training step effectively trains a different sub-network, and at inference time (when all neurons are active), the full network behaves like an ensemble of all these sub-networks. The / (1 - self.p) scaling factor (called “inverted dropout”) ensures the expected output magnitude stays the same whether dropout is active or not, so you do not need to adjust anything at inference time.
Layer TypeTypical Dropout Rate
Fully connected0.3 - 0.5
After attention0.1 - 0.3
Embedding0.0 - 0.1

Data Augmentation

The most effective regularizer: artificially expand training set.
from torchvision import transforms

# Standard augmentation pipeline
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

Advanced Augmentations

These techniques go beyond geometric transforms — they force the network to cope with partial information loss, which dramatically improves robustness.
# CutOut: Randomly mask a rectangular region with zeros.
# Forces the network to not rely on any single spatial region.
# Think of it like covering part of a photo with your thumb --
# you can still recognize the object from the visible parts.
class Cutout:
    def __init__(self, size=16):
        self.size = size  # Side length of the square mask
    
    def __call__(self, img):
        h, w = img.shape[1:]
        # Pick a random center point for the mask
        y = torch.randint(h, (1,)).item()
        x = torch.randint(w, (1,)).item()
        
        # Compute mask boundaries, clamped to image edges
        y1 = max(0, y - self.size // 2)
        y2 = min(h, y + self.size // 2)
        x1 = max(0, x - self.size // 2)
        x2 = min(w, x + self.size // 2)
        
        img[:, y1:y2, x1:x2] = 0  # Zero out the region
        return img

# MixUp: Blend two training samples and their labels.
# Instead of training on pure examples, the network sees
# weighted combinations: "this is 70% cat and 30% dog."
# This smooths decision boundaries and reduces overconfidence.
def mixup(x, y, alpha=0.2):
    # Sample mixing coefficient from Beta distribution.
    # alpha=0.2 gives most values near 0 or 1 (slight mixing).
    # alpha=1.0 gives uniform mixing (more aggressive).
    lam = torch.distributions.Beta(alpha, alpha).sample()
    batch_size = x.size(0)
    index = torch.randperm(batch_size)  # Random pairing within batch
    
    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]
    
    return mixed_x, y_a, y_b, lam

# Training with MixUp -- note the loss is also a weighted combination
x, y_a, y_b, lam = mixup(x, y)
loss = lam * criterion(model(x), y_a) + (1 - lam) * criterion(model(x), y_b)
Practical tip: MixUp and CutMix are among the highest-impact regularizers for image classification — often adding 1-2% accuracy on top of standard augmentations. They are essentially free (negligible compute overhead) and should be part of your default training pipeline for vision tasks. The combination of CutMix + MixUp + Label Smoothing is the “holy trinity” of modern augmentation-based regularization.

Label Smoothing

Hard labels say “this is 100% cat, 0% everything else.” But real-world data is ambiguous — that blurry image might be 95% cat and 5% could-be-a-small-dog. Label smoothing softens the targets to reflect this uncertainty, preventing the model from becoming overconfident in its predictions. An overconfident model produces sharp, peaky probability distributions that do not calibrate well — label smoothing fixes this. Soften hard labels to prevent overconfidence: ysmooth=(1α)yhard+αKy_{\text{smooth}} = (1 - \alpha) \cdot y_{\text{hard}} + \frac{\alpha}{K}
class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing
    
    def forward(self, pred, target):
        n_classes = pred.size(-1)
        log_probs = torch.log_softmax(pred, dim=-1)
        
        # Smooth labels
        targets = torch.zeros_like(log_probs).scatter_(
            1, target.unsqueeze(1), 1
        )
        targets = (1 - self.smoothing) * targets + self.smoothing / n_classes
        
        loss = (-targets * log_probs).sum(dim=-1).mean()
        return loss

Early Stopping

Monitor validation loss; stop when it stops improving:
class EarlyStopping:
    def __init__(self, patience=10, min_delta=0.0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float('inf')
        self.should_stop = False
    
    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            torch.save(model.state_dict(), 'best_model.pt')
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
        return self.should_stop

# Usage
early_stopping = EarlyStopping(patience=10)
for epoch in range(max_epochs):
    train(...)
    val_loss = validate(...)
    if early_stopping(val_loss, model):
        print("Early stopping triggered!")
        break

Comparison of Regularization Techniques

TechniqueEffectWhen to Use
Weight DecayPenalize large weightsAlways (0.01-0.1)
DropoutRandom deactivationDense layers, attention
Data AugmentationExpand training dataAlways for vision
Label SmoothingSoften targetsClassification
Early StoppingPrevent overtrainingAlways
Stochastic DepthDrop whole layersVery deep networks

Exercises

Train a network with dropout rates 0, 0.1, 0.3, 0.5, 0.7. Plot train vs val accuracy for each.
Compare model performance with: no augmentation, basic flips, full augmentation pipeline.
Implement CutMix (rectangular patches from different images) and compare with MixUp.

What’s Next

Module 18: Optimization Algorithms

SGD, Adam, AdamW, and modern optimizers for deep learning.