Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Normalization Techniques

Normalization Techniques

Why Normalize?

Imagine a factory assembly line where each station expects parts within a certain size range. If Station 3 suddenly starts outputting parts twice as big, Station 4’s machinery jams. Someone has to stop the line, recalibrate, and restart. Now imagine this recalibration happening after every batch of parts. That is what deep networks face without normalization: each layer’s input distribution shifts as the layers before it update their weights, and every layer has to constantly re-adapt instead of learning useful features. Deep networks suffer from internal covariate shift — the distribution of inputs to each layer changes during training as preceding weights update. This causes:
  • Slower training (need smaller learning rates to avoid instability)
  • Difficulty with saturating activations (inputs drift into flat regions of sigmoid/tanh)
  • Careful initialization requirements (poor init = immediate failure)
Normalization stabilizes these distributions by re-centering and re-scaling activations at each layer, enabling:
  • Higher learning rates (often 10x larger)
  • Faster convergence (typically 2-3x fewer epochs)
  • Reduced sensitivity to initialization (networks “just work” with standard init)
The honest truth about “internal covariate shift”: The original BatchNorm paper (Ioffe and Szegedy, 2015) attributed its success to reducing internal covariate shift. Later research (Santurkar et al., 2018) showed this explanation is incomplete — BatchNorm primarily works by smoothing the loss landscape, making it easier for optimizers to navigate. The mechanism matters less than the result: normalization is one of the most impactful techniques in modern deep learning.

Batch Normalization

The most influential normalization technique, introduced in 2015. The core idea: for each feature channel, compute the mean and variance across all examples in the current mini-batch, then normalize so the channel has zero mean and unit variance. Finally, apply a learnable scale (γ\gamma) and shift (β\beta) so the network can undo the normalization if that is optimal. Why the learnable parameters? Without γ\gamma and β\beta, normalization would force every layer’s output to have zero mean and unit variance, which might be too restrictive. The learnable parameters let the network decide the “ideal” mean and variance for each channel. If the network learns γ=σB\gamma = \sigma_B and β=μB\beta = \mu_B, it has effectively undone the normalization — so BatchNorm can never hurt (in theory). Normalize across the batch dimension: x^i=xiμBσB2+ϵγ+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \cdot \gamma + \beta Where μB\mu_B and σB\sigma_B are batch statistics, and γ\gamma, β\beta are learnable.
import torch
import torch.nn as nn

class BatchNorm(nn.Module):
    """Batch Normalization from scratch."""
    
    def __init__(self, num_features, eps=1e-5, momentum=0.1):
        super().__init__()
        self.eps = eps
        self.momentum = momentum
        
        # Learnable parameters
        self.gamma = nn.Parameter(torch.ones(num_features))
        self.beta = nn.Parameter(torch.zeros(num_features))
        
        # Running statistics (not learnable)
        self.register_buffer('running_mean', torch.zeros(num_features))
        self.register_buffer('running_var', torch.ones(num_features))
    
    def forward(self, x):
        if self.training:
            # Compute batch statistics
            mean = x.mean(dim=(0, 2, 3))  # Mean over batch, H, W
            var = x.var(dim=(0, 2, 3), unbiased=False)
            
            # Update running statistics
            self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean
            self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var
        
        # Normalize
        x = (x - mean.view(1, -1, 1, 1)) / torch.sqrt(var.view(1, -1, 1, 1) + self.eps)
        
        # Scale and shift
        return self.gamma.view(1, -1, 1, 1) * x + self.beta.view(1, -1, 1, 1)
BatchNorm behaves differently at train vs eval! During training, it uses the current batch’s statistics. During inference, it uses the running averages accumulated during training. Forgetting model.eval() before inference is one of the most common bugs in deep learning — your model will produce inconsistent outputs that depend on what other examples happen to be in the inference batch. Always call model.eval() before inference and model.train() before training resumes.
Pitfall — small batch sizes: BatchNorm estimates mean and variance from the current mini-batch. With a batch size of 2 or 4, these estimates are extremely noisy, which destabilizes training. If your GPU memory limits you to small batches (common with high-resolution images or large models), switch to GroupNorm or LayerNorm instead. A good rule of thumb: BatchNorm works well with batch sizes of 32 or larger; below 16, consider alternatives.

Layer Normalization

The critical difference from BatchNorm: LayerNorm normalizes across the feature dimension within a single sample, completely independent of other examples in the batch. Think of it this way — BatchNorm asks “how does this feature compare across all images in the batch?” while LayerNorm asks “how does this feature compare to all other features within this one image/token?” This independence from batch size is why LayerNorm became the standard for Transformers and RNNs, where batch sizes vary and sequences have different lengths. x^=xμσ2+ϵγ+β\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta Where μ\mu and σ\sigma are computed per sample across all features.
class LayerNorm(nn.Module):
    """Layer Normalization -- the default choice for Transformers.
    
    Unlike BatchNorm, statistics are computed per-sample across
    all features, so behavior is identical at train and eval time.
    No running statistics to maintain, no batch-size sensitivity.
    """
    
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        self.eps = eps
        # Learnable scale and shift, same role as in BatchNorm
        self.gamma = nn.Parameter(torch.ones(normalized_shape))
        self.beta = nn.Parameter(torch.zeros(normalized_shape))
    
    def forward(self, x):
        # Compute mean and variance across the last dimension (features)
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        return self.gamma * (x - mean) / torch.sqrt(var + self.eps) + self.beta
Used in: Transformers, RNNs, and any architecture where batch-size independence matters
Pre-norm vs post-norm placement matters a lot. In Transformers, placing LayerNorm before the attention/FFN block (pre-norm) generally trains more stably than the original post-norm placement. Most modern architectures (GPT-2 onward, LLaMA, etc.) use pre-norm. The reason: pre-norm ensures the residual stream stays well-conditioned, preventing gradient explosion in deep transformer stacks.

Comparison of Normalization Types

Normalization Types Comparison
TypeNormalize OverBest For
Batch NormBatch, H, WCNNs with large batches
Layer NormC, H, W (per sample)Transformers, RNNs
Instance NormH, W (per channel)Style transfer
Group NormGroups of channelsSmall batches
RMSNormFeatures (no mean)LLMs (faster)

RMSNorm (Modern LLMs)

RMSNorm is a simplified variant of LayerNorm that skips the mean-centering step entirely and only divides by the root-mean-square of the activations. This saves both compute and memory (no mean calculation, no beta parameter). The empirical finding is surprising: the re-centering step in LayerNorm contributes very little to training stability; most of the benefit comes from the re-scaling alone. This is why LLaMA, Gemma, Mistral, and most modern LLMs use RMSNorm — it is approximately 10-15% faster than LayerNorm at scale, with no measurable loss in quality. RMSNorm(x)=xRMS(x)γ,RMS(x)=1nxi2\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum x_i^2}
class RMSNorm(nn.Module):
    """Root Mean Square Normalization (used in LLaMA, Gemma, Mistral, etc.).
    
    Compared to LayerNorm:
    - No mean subtraction (no re-centering)
    - No learnable bias (beta parameter)
    - ~10-15% faster at scale with equivalent quality
    """
    
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        # Only a scale parameter, no bias -- this is the key simplification
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        # Root mean square: sqrt(mean(x^2))
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

When to Use What

ScenarioRecommendation
CNN with batch ≥ 32Batch Norm
CNN with small batchGroup Norm
Transformer/AttentionLayer Norm or RMSNorm
RNN/LSTMLayer Norm
Style TransferInstance Norm
Modern LLMRMSNorm

Exercises

Train the same CNN with and without BatchNorm. Compare learning curves, final accuracy, and sensitivity to learning rate.
Compare BatchNorm vs GroupNorm with batch sizes of 2, 4, 8, 16, 32.
In transformers, compare placing LayerNorm before vs after attention/FFN blocks.

What’s Next

Module 17: Regularization for Deep Networks

Dropout, weight decay, and other techniques for preventing overfitting.