Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Normalization Techniques
Why Normalize?
Imagine a factory assembly line where each station expects parts within a certain size range. If Station 3 suddenly starts outputting parts twice as big, Station 4’s machinery jams. Someone has to stop the line, recalibrate, and restart. Now imagine this recalibration happening after every batch of parts. That is what deep networks face without normalization: each layer’s input distribution shifts as the layers before it update their weights, and every layer has to constantly re-adapt instead of learning useful features. Deep networks suffer from internal covariate shift — the distribution of inputs to each layer changes during training as preceding weights update. This causes:- Slower training (need smaller learning rates to avoid instability)
- Difficulty with saturating activations (inputs drift into flat regions of sigmoid/tanh)
- Careful initialization requirements (poor init = immediate failure)
- Higher learning rates (often 10x larger)
- Faster convergence (typically 2-3x fewer epochs)
- Reduced sensitivity to initialization (networks “just work” with standard init)
The honest truth about “internal covariate shift”: The original BatchNorm paper (Ioffe and Szegedy, 2015) attributed its success to reducing internal covariate shift. Later research (Santurkar et al., 2018) showed this explanation is incomplete — BatchNorm primarily works by smoothing the loss landscape, making it easier for optimizers to navigate. The mechanism matters less than the result: normalization is one of the most impactful techniques in modern deep learning.
Batch Normalization
The most influential normalization technique, introduced in 2015. The core idea: for each feature channel, compute the mean and variance across all examples in the current mini-batch, then normalize so the channel has zero mean and unit variance. Finally, apply a learnable scale () and shift () so the network can undo the normalization if that is optimal. Why the learnable parameters? Without and , normalization would force every layer’s output to have zero mean and unit variance, which might be too restrictive. The learnable parameters let the network decide the “ideal” mean and variance for each channel. If the network learns and , it has effectively undone the normalization — so BatchNorm can never hurt (in theory). Normalize across the batch dimension: Where and are batch statistics, and , are learnable.Pitfall — small batch sizes: BatchNorm estimates mean and variance from the current mini-batch. With a batch size of 2 or 4, these estimates are extremely noisy, which destabilizes training. If your GPU memory limits you to small batches (common with high-resolution images or large models), switch to GroupNorm or LayerNorm instead. A good rule of thumb: BatchNorm works well with batch sizes of 32 or larger; below 16, consider alternatives.
Layer Normalization
The critical difference from BatchNorm: LayerNorm normalizes across the feature dimension within a single sample, completely independent of other examples in the batch. Think of it this way — BatchNorm asks “how does this feature compare across all images in the batch?” while LayerNorm asks “how does this feature compare to all other features within this one image/token?” This independence from batch size is why LayerNorm became the standard for Transformers and RNNs, where batch sizes vary and sequences have different lengths. Where and are computed per sample across all features.Comparison of Normalization Types
| Type | Normalize Over | Best For |
|---|---|---|
| Batch Norm | Batch, H, W | CNNs with large batches |
| Layer Norm | C, H, W (per sample) | Transformers, RNNs |
| Instance Norm | H, W (per channel) | Style transfer |
| Group Norm | Groups of channels | Small batches |
| RMSNorm | Features (no mean) | LLMs (faster) |
RMSNorm (Modern LLMs)
RMSNorm is a simplified variant of LayerNorm that skips the mean-centering step entirely and only divides by the root-mean-square of the activations. This saves both compute and memory (no mean calculation, no beta parameter). The empirical finding is surprising: the re-centering step in LayerNorm contributes very little to training stability; most of the benefit comes from the re-scaling alone. This is why LLaMA, Gemma, Mistral, and most modern LLMs use RMSNorm — it is approximately 10-15% faster than LayerNorm at scale, with no measurable loss in quality.When to Use What
| Scenario | Recommendation |
|---|---|
| CNN with batch ≥ 32 | Batch Norm |
| CNN with small batch | Group Norm |
| Transformer/Attention | Layer Norm or RMSNorm |
| RNN/LSTM | Layer Norm |
| Style Transfer | Instance Norm |
| Modern LLM | RMSNorm |
Exercises
Exercise 1: BatchNorm Analysis
Exercise 1: BatchNorm Analysis
Train the same CNN with and without BatchNorm. Compare learning curves, final accuracy, and sensitivity to learning rate.
Exercise 2: Small Batch Study
Exercise 2: Small Batch Study
Compare BatchNorm vs GroupNorm with batch sizes of 2, 4, 8, 16, 32.
Exercise 3: Pre-Norm vs Post-Norm
Exercise 3: Pre-Norm vs Post-Norm
In transformers, compare placing LayerNorm before vs after attention/FFN blocks.
What’s Next
Module 17: Regularization for Deep Networks
Dropout, weight decay, and other techniques for preventing overfitting.