Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Diffusion Models

Diffusion Models

The Core Idea

Diffusion models work by:
  1. Forward process: Gradually add noise to data until it becomes pure noise
  2. Reverse process: Learn to denoise step by step, recovering the original data
Think of it like this: imagine you have a pristine photograph. You photocopy it, then photocopy the photocopy, then photocopy that — each generation adds a little more noise and blur until, after a thousand copies, you have nothing but static. That is the forward process. Now imagine you train someone to look at any noisy photocopy and predict what the previous, slightly-less-noisy version looked like. If they can do that one small step reliably, they can chain those steps together — starting from pure static and working backward — to reconstruct a crisp image that never existed in the training set. That is the reverse process. Another way to think about it: imagine sculpting from marble. You start with a formless block (pure noise) and chip away small imperfections in each step. No single chisel stroke creates the sculpture — it emerges from hundreds of tiny, precise removals. Each denoising step is one chisel strike, and the neural network has learned where to strike next by studying thousands of finished sculptures (training images) and their partially-chipped states. Why this idea is so powerful: unlike GANs, which require a delicate adversarial dance between two networks, diffusion models optimize a single, stable denoising objective. The training loss is just mean-squared error on predicted noise — as boring and well-understood as it gets. The magic emerges from chaining many small, easy denoising steps into one large, creative generation process.
A senior engineer would frame it this way: “Diffusion models decompose one impossibly hard problem — generate a realistic image from nothing — into a thousand easy problems: remove a tiny bit of noise. Each sub-problem is a simple regression task. The genius is in the decomposition, not the network architecture.”
Diffusion Process

Mathematical Foundation

Forward Diffusion (Adding Noise)

At each step tt, we add a small amount of Gaussian noise: q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I) Where βt\beta_t is the noise schedule — a small positive number (typically between 0.0001 and 0.02) that controls how much noise is added at step tt. The 1βt\sqrt{1-\beta_t} factor slightly shrinks the signal while βt\beta_t controls the noise variance. Over many steps, the signal is completely destroyed. The key mathematical trick: we do not need to run all TT steps sequentially. We can jump directly to any timestep tt in closed form: q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I) Where αt=1βt\alpha_t = 1 - \beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s (the cumulative product of all alphas up to step tt). Intuition: αˉt\bar{\alpha}_t decays from nearly 1 (almost clean image) to nearly 0 (almost pure noise). So xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon is just a weighted blend of the original image and random noise. Early timesteps are mostly signal; late timesteps are mostly noise.
import torch
import torch.nn as nn
import numpy as np

class DiffusionSchedule:
    """Noise schedule for diffusion process."""
    
    def __init__(self, timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.timesteps = timesteps
        
        # Linear schedule: betas grow linearly from near-zero to 0.02
        # This means early steps add very little noise, later steps add more
        self.betas = torch.linspace(beta_start, beta_end, timesteps)
        self.alphas = 1.0 - self.betas
        
        # Cumulative product: this is the key quantity that lets us
        # jump to any timestep directly without iterating
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        
        # Precompute coefficients for the closed-form noising formula:
        # x_t = sqrt(alpha_cumprod_t) * x_0 + sqrt(1 - alpha_cumprod_t) * noise
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
    
    def add_noise(self, x_0, t, noise=None):
        """
        Add noise to x_0 at timestep t using the closed-form formula.
        This is the 'reparameterization trick' applied to diffusion --
        we can sample x_t directly without running t sequential steps.
        """
        if noise is None:
            noise = torch.randn_like(x_0)
        
        # Reshape for broadcasting across image dimensions (B, C, H, W)
        sqrt_alpha = self.sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
        
        # Weighted sum: original signal + noise
        return sqrt_alpha * x_0 + sqrt_one_minus_alpha * noise

Reverse Process (Learning to Denoise)

We train a neural network ϵθ\epsilon_\theta to predict the noise that was added at step tt: L=Ex0,t,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right] Read this carefully — the loss is beautifully simple. We take a clean image x0x_0, pick a random timestep tt, add known noise ϵ\epsilon to get xtx_t, then ask the network “what noise was added?” The loss is just MSE between the actual noise and the predicted noise. No adversarial training, no complex objectives — just noise prediction. Why predict noise instead of the clean image? Empirically, noise prediction gives more stable gradients. Intuitively, predicting noise is a “residual” task — the network only needs to learn what was added, not reconstruct the entire image from scratch. This is the same insight that makes ResNets work.
class SimpleDiffusion(nn.Module):
    """Simple U-Net style denoiser."""
    
    def __init__(self, channels=1, time_emb_dim=32):
        super().__init__()
        
        # Time embedding
        self.time_mlp = nn.Sequential(
            nn.Linear(1, time_emb_dim),
            nn.GELU(),
            nn.Linear(time_emb_dim, time_emb_dim),
        )
        
        # Encoder
        self.enc1 = nn.Conv2d(channels, 64, 3, padding=1)
        self.enc2 = nn.Conv2d(64, 128, 3, stride=2, padding=1)
        self.enc3 = nn.Conv2d(128, 256, 3, stride=2, padding=1)
        
        # Decoder
        self.dec3 = nn.ConvTranspose2d(256 + time_emb_dim, 128, 4, stride=2, padding=1)
        self.dec2 = nn.ConvTranspose2d(256, 64, 4, stride=2, padding=1)
        self.dec1 = nn.Conv2d(128, channels, 3, padding=1)
    
    def forward(self, x, t):
        # Time embedding
        t_emb = self.time_mlp(t.float().unsqueeze(-1) / 1000)
        
        # Encode
        e1 = torch.relu(self.enc1(x))
        e2 = torch.relu(self.enc2(e1))
        e3 = torch.relu(self.enc3(e2))
        
        # Add time embedding
        t_emb = t_emb.view(t_emb.size(0), -1, 1, 1).expand(-1, -1, e3.size(2), e3.size(3))
        e3 = torch.cat([e3, t_emb], dim=1)
        
        # Decode with skip connections
        d3 = torch.relu(self.dec3(e3))
        d2 = torch.relu(self.dec2(torch.cat([d3, e2], dim=1)))
        d1 = self.dec1(torch.cat([d2, e1], dim=1))
        
        return d1

Training Loop

The training algorithm is refreshingly simple compared to GANs — no adversarial balancing, no mode collapse to worry about. Each iteration samples a random timestep, corrupts a clean image to that noise level, and asks the network to predict what noise was added.
def train_diffusion(model, dataloader, schedule, epochs=10):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    
    for epoch in range(epochs):
        for batch in dataloader:
            x_0 = batch[0]  # Clean images from the dataset
            batch_size = x_0.size(0)
            
            # Sample a random timestep for each image in the batch.
            # This is crucial: each image gets a DIFFERENT noise level,
            # so the network learns to denoise at every stage.
            t = torch.randint(0, schedule.timesteps, (batch_size,))
            
            # Generate the target noise and create the noisy version
            noise = torch.randn_like(x_0)
            x_t = schedule.add_noise(x_0, t, noise)
            
            # Ask the network: "given this noisy image at timestep t,
            # what noise was added?" This is a simple regression task.
            predicted_noise = model(x_t, t)
            
            # MSE between actual noise and predicted noise.
            # That is the entire loss function -- no adversarial terms,
            # no KL divergence, no reconstruction loss. Just MSE.
            loss = nn.MSELoss()(predicted_noise, noise)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
Training tip: Uniform timestep sampling (as shown above) works, but importance sampling — where you sample more from timesteps with higher loss — can speed up convergence by 20-40%. The P2 weighting paper showed that emphasizing middle timesteps (where the network learns the most useful structure) significantly improves sample quality.

Sampling (Generation)

Sampling is the reverse of the forward process: start from pure Gaussian noise and iteratively denoise. Think of it like developing a Polaroid photo — the image gradually emerges from chaos, with coarse structure appearing first (large-scale shapes and colors) and fine details (textures, edges) filling in during the last steps.
@torch.no_grad()  # No gradients needed -- we are only generating, not training
def sample(model, schedule, shape, device='cpu'):
    """Generate samples by reverse diffusion."""
    # Start from pure noise -- this is our 'blank canvas'
    x = torch.randn(shape).to(device)
    
    # Walk backward from t=T (pure noise) to t=0 (clean image)
    for t in reversed(range(schedule.timesteps)):
        t_batch = torch.tensor([t] * shape[0]).to(device)
        
        # Ask the trained network: "what noise is in this image?"
        predicted_noise = model(x, t_batch)
        
        # Retrieve the precomputed schedule constants for this timestep
        alpha = schedule.alphas[t]
        alpha_cumprod = schedule.alphas_cumprod[t]
        beta = schedule.betas[t]
        
        # Add stochasticity at every step EXCEPT the final one.
        # At t=0 we want a deterministic, clean output.
        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = 0
        
        # The DDPM update rule: remove the predicted noise component,
        # then add a small amount of fresh noise for stochasticity.
        # The (1/sqrt(alpha)) factor rescales the signal back up,
        # and the (beta/sqrt(1-alpha_cumprod)) factor controls
        # how much of the predicted noise to subtract.
        x = (1 / torch.sqrt(alpha)) * (
            x - (beta / torch.sqrt(1 - alpha_cumprod)) * predicted_noise
        ) + torch.sqrt(beta) * noise
    
    return x
Pitfall — sampling speed: The standard DDPM sampler requires 1000 forward passes through the U-Net to generate a single image. For a model with 860M parameters (Stable Diffusion’s U-Net), that is painfully slow. In practice, use accelerated samplers like DDIM (which can skip steps, reducing to 20-50 steps with minimal quality loss) or DPM-Solver++ (which treats denoising as an ODE and uses higher-order solvers). A senior engineer would never ship a product with the naive 1000-step sampler.

Classifier-Free Guidance

Classifier-free guidance (CFG) is the mechanism that lets you steer generation with a text prompt (or class label). The intuition is surprisingly elegant: during training, the model randomly drops the conditioning signal some fraction of the time (say 10%), so it learns both conditional and unconditional generation. At inference, you run the model twice — once with your prompt and once without — and amplify the difference. Think of it like asking for directions. The unconditional prediction says “go vaguely north.” The conditional prediction (with your prompt) says “go northeast toward the bakery.” Guidance amplifies the difference: “go VERY northeast toward the bakery.” Higher guidance scale means stronger adherence to the prompt, at the cost of reduced diversity. ϵθ(xt,c)=ϵθ(xt,)+s(ϵθ(xt,c)ϵθ(xt,))\epsilon_\theta(x_t, c) = \epsilon_\theta(x_t, \emptyset) + s \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset)) Where s>1s > 1 is the guidance scale (typically 7.5 for Stable Diffusion). At s=1s = 1 you get the raw conditional model; at s=0s = 0 you get unconditional generation. Values above 7-8 tend to produce over-saturated, artifact-heavy images — a common beginner mistake is cranking guidance to 20+ and wondering why the outputs look “deep-fried.”
Practical guidance scale ranges: For photorealistic images, 5-8 works well. For artistic/stylized outputs, 3-5 can give better variety. For inpainting tasks, lower values (2-4) often produce more natural blending with the surrounding context.

Connection to Stable Diffusion

Running diffusion directly on 512x512 pixel images is absurdly expensive — the U-Net would need to process 786,432 values per image at every timestep. Stable Diffusion’s key insight is to run the entire diffusion process in a compressed latent space instead. This is like editing a blueprint instead of rebuilding the house for every revision. Stable Diffusion operates in latent space for efficiency:
  1. VAE Encoder: Compress 512x512 image to 64x64 latent (64x spatial compression)
  2. U-Net: Denoise in latent space (operating on 4,096 values instead of 786,432 — roughly 192x cheaper)
  3. VAE Decoder: Expand latent back to full-resolution image
  4. CLIP Text Encoder: Convert text prompts into conditioning vectors that guide the denoising
This architecture — called a Latent Diffusion Model (LDM) — is what made high-resolution image generation practical on consumer GPUs. Training the pixel-space diffusion model behind DALL-E 2 required thousands of GPU-hours; Stable Diffusion’s latent approach brought that down dramatically.
Pitfall — VAE quality ceiling: Because the final image must pass through the VAE decoder, the VAE’s reconstruction quality puts a hard ceiling on output fidelity. Fine details that the VAE cannot reconstruct will never appear in generated images, no matter how good your diffusion model is. This is why newer versions of Stable Diffusion ship with improved VAE decoders.

Exercises

Train a diffusion model on MNIST. Generate digit samples and visualize the denoising process.
Implement and compare linear, cosine, and quadratic noise schedules.
Add class conditioning to generate specific digits.

What’s Next

Module 15: Residual & Skip Connections

Learn how to train very deep networks with identity mappings.