Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Generative Adversarial Networks
The Counterfeiter vs Detective Game
Imagine a world with two players locked in an eternal battle:- The Counterfeiter (Generator): Creates fake money, trying to make it indistinguishable from real currency
- The Detective (Discriminator): Examines bills and tries to identify which are real and which are fake
GAN Architecture Overview
| Component | Role | Input | Output |
|---|---|---|---|
| Generator (G) | Creates fake data | Random noise | Fake sample |
| Discriminator (D) | Classifies real/fake | Sample | Probability |
The Minimax Loss Function
The GAN training objective is a minimax game: Intuition behind the math: The Discriminator is playing a classification game — it wants to output 1 for real samples and 0 for fakes. The logarithm amplifies mistakes: punishes D heavily when it assigns low probability to a real sample (e.g., ), but barely rewards it for being right (). This asymmetric penalty is what drives both networks to improve rapidly.| Term | Meaning | What It Does |
|---|---|---|
| Expected log probability for real data | D wants this HIGH (correctly identify real) | |
| Expected log probability for fake data | D wants this HIGH, G wants this LOW |
- Correctly classify real images →
- Correctly classify fake images →
- Fool the discriminator →
Complete GAN Training Loop
Let’s train a GAN on MNIST digits:Mode Collapse: The GAN’s Achilles Heel
Why Does Mode Collapse Happen?
| Cause | Explanation |
|---|---|
| Easy shortcut | Generator finds one mode that consistently fools D |
| Discriminator forgetting | D gets fooled by one mode, G exploits it |
| Training imbalance | G or D becomes too strong too quickly |
DCGAN: Deep Convolutional GAN
| Guideline | Generator | Discriminator |
|---|---|---|
| Pooling | Use transposed conv | Use strided conv |
| Normalization | BatchNorm (except output) | BatchNorm (except input) |
| Activation | ReLU (except output: Tanh) | LeakyReLU |
| Architecture | Fully convolutional | Fully convolutional |
Wasserstein GAN (WGAN)
Why Wasserstein Distance?
The analogy: Imagine you have two piles of sand (the real and generated distributions) and you want to measure how different they are. The Wasserstein distance measures the minimum amount of “work” (mass times distance) needed to reshape one pile into the other. Unlike JS divergence, which jumps between 0 and when distributions don’t overlap, the Wasserstein distance changes smoothly — giving the generator useful gradient signal even when the discriminator can perfectly separate real from fake.| Advantage | Explanation |
|---|---|
| Continuous gradients | Provides gradients even when distributions don’t overlap |
| Meaningful loss | Loss correlates with image quality |
| Stable training | No need for careful balance between G and D |
Conditional GANs (cGAN)
GAN Evaluation Metrics
| Metric | Measures | Formula/Approach | |
|---|---|---|---|
| Inception Score (IS) | Quality + Diversity | $\exp(\mathbb[D_(p(y | x) | p(y))])$ |
| FID (Fréchet Inception Distance) | Feature similarity | Compare Gaussian fits in feature space | |
| LPIPS | Perceptual similarity | Distance in learned feature space |
Exercises
Exercise 1: Implement Label Smoothing
Exercise 1: Implement Label Smoothing
Exercise 2: Build a Progressive Training Schedule
Exercise 2: Build a Progressive Training Schedule
Exercise 3: Implement Spectral Normalization
Exercise 3: Implement Spectral Normalization
Key Takeaways
- ✅ GAN Fundamentals - Generator creates, Discriminator classifies in a minimax game
- ✅ Minimax Loss - Adversarial training objective and its components
- ✅ Mode Collapse - Common failure mode and solutions (minibatch discrimination, feature matching)
- ✅ DCGAN - Convolutional architecture guidelines for stable training
- ✅ WGAN - Wasserstein distance for improved training stability
- ✅ Conditional GANs - Control generation with class labels or other conditioning
- ✅ Evaluation - FID, Inception Score, and diversity metrics
Training Tips from the Trenches
Interview Deep-Dive
Explain the GAN minimax objective. Why does the original formulation suffer from vanishing gradients, and how does the non-saturating loss fix it?
Explain the GAN minimax objective. Why does the original formulation suffer from vanishing gradients, and how does the non-saturating loss fix it?
- The minimax objective is . The Discriminator maximizes this expression (correctly classifying real and fake), while the Generator minimizes it (fooling D).
- Vanishing gradient problem: Early in training, the Generator produces obviously fake outputs. The Discriminator quickly learns to reject them with . The Generator’s gradient comes from , which is — the gradient is essentially zero. The Generator receives almost no learning signal precisely when it needs the most guidance.
- Non-saturating fix: Instead of minimizing , the Generator maximizes . When , this gives , producing a very large gradient. The Generator now receives strong signal to improve even when the Discriminator easily rejects its outputs.
- A senior engineer would note: the non-saturating loss changes the optimization landscape but doesn’t change the theoretical equilibrium point. Both formulations converge to at the Nash equilibrium. The practical difference is entirely about gradient magnitude during early training.
What is mode collapse in GANs? Describe three different techniques to mitigate it and explain the trade-offs of each.
What is mode collapse in GANs? Describe three different techniques to mitigate it and explain the trade-offs of each.
- Mode collapse is when the Generator maps many different noise vectors to a small set of outputs, producing limited diversity. In the extreme case (“complete collapse”), every input produces the same image. Partial collapse is more common: a GAN trained on MNIST might only generate digits 1, 7, and 9, ignoring the other seven classes.
- Technique 1: Minibatch Discrimination. The Discriminator receives additional features computed across the batch, allowing it to detect when all generated samples look too similar. Trade-off: adds computational overhead and couples predictions within a batch, which complicates distributed training.
- Technique 2: Wasserstein loss (WGAN-GP). By replacing the JS divergence with the Wasserstein distance, the critic provides meaningful gradients even when distributions don’t overlap, reducing the “shortcut incentive” that causes collapse. Trade-off: requires training the critic for multiple steps per generator step (typically 5), increasing wall-clock time by 3-5x. Also requires removing BatchNorm from the critic when using gradient penalty.
- Technique 3: Unrolled GANs. The Generator anticipates future Discriminator updates by unrolling K steps of D’s optimization. This prevents G from over-exploiting D’s current weaknesses. Trade-off: memory-intensive (must store K computation graphs) and adds significant complexity. Rarely used in production — more of a research technique.
- A senior engineer would add: in practice, monitoring for mode collapse matters more than any single technique. Track the diversity of generated samples using FID, the number of distinct modes in generated class distributions, or simply visual inspection grids. If collapse is detected, the most pragmatic fix is often reducing the learning rate of G relative to D, or switching to a progressive training schedule.
Compare WGAN-GP, spectral normalization, and the original GAN loss. When would you choose each?
Compare WGAN-GP, spectral normalization, and the original GAN loss. When would you choose each?
- Original GAN loss (BCE): Simple to implement, works well with DCGAN architecture and careful hyperparameter tuning. Choose this when you want a quick prototype and your dataset is well-behaved (balanced, sufficient data). The main risk is training instability and mode collapse.
- WGAN-GP: Uses the Wasserstein distance approximated via a gradient penalty on the critic. The critic loss directly correlates with sample quality (unlike BCE loss), making it a useful training diagnostic. Choose WGAN-GP when training stability is paramount or when the original GAN loss diverges. The cost is 3-5x slower training due to multiple critic updates per generator step, plus the gradient penalty computation doubles backward-pass cost.
- Spectral Normalization (SN-GAN): Constrains the Lipschitz constant of the Discriminator by normalizing each weight matrix by its spectral norm (largest singular value). Unlike WGAN-GP, it requires only one D update per G update and adds negligible computational overhead (one power iteration step per forward pass). Choose SN when you want WGAN-level stability without the training cost. In practice, SN-GAN has become the default for many architectures including BigGAN and StyleGAN.
- Decision framework: for production image generation, start with SN-GAN. If quality plateaus, try WGAN-GP. Only fall back to vanilla BCE loss for simple datasets (MNIST, CIFAR) where you know the hyperparameters work. For large-scale generation (ImageNet, faces), modern architectures like StyleGAN2 use a combination of SN plus R1 gradient penalty, which is another flavor of the same Lipschitz-constraint idea.
How do you evaluate the quality and diversity of a GAN's outputs? Discuss FID, IS, and their limitations.
How do you evaluate the quality and diversity of a GAN's outputs? Discuss FID, IS, and their limitations.
- Inception Score (IS): measures two things: (1) quality — high-quality images should have confident, peaked class predictions , and (2) diversity — the marginal distribution should be uniform across all classes. IS = . Higher is better.
- IS limitations: it uses InceptionV3 trained on ImageNet, so it’s biased toward ImageNet-like images. A GAN generating perfect medical images would get a low IS. It also can’t detect intra-class mode collapse (generating only one type of dog still gets high IS if different breeds are represented). And critically, IS doesn’t compare against real data at all — a GAN could generate images from a completely different distribution and still get high IS.
- Frechet Inception Distance (FID): computes the Frechet distance between two multivariate Gaussians fitted to InceptionV3 features of real and generated images: . Lower is better. FID captures both quality and diversity and compares against the actual data distribution.
- FID limitations: assumes Gaussian feature distributions (which is a rough approximation), sensitive to sample size (need at least 10,000 samples for reliable estimates, ideally 50,000), and still depends on InceptionV3 features. For domains far from natural images, consider domain-specific metrics or Kernel Inception Distance (KID), which has an unbiased estimator and is less sensitive to sample size.
- A senior engineer would add: never rely on a single metric. Use FID as the primary quantitative measure, but always supplement with visual inspection (grid plots), precision/recall curves (to separate quality from diversity failures), and domain-specific metrics where applicable.
Design a GAN system for generating realistic product images for an e-commerce platform. Walk through your architecture choices, training strategy, and quality assurance pipeline.
Design a GAN system for generating realistic product images for an e-commerce platform. Walk through your architecture choices, training strategy, and quality assurance pipeline.
- Architecture choice: StyleGAN2 or StyleGAN3 as the backbone. These architectures produce the highest-quality images for structured objects and offer fine-grained control via the style-based generator. The mapping network transforms the latent code into an intermediate space , which controls different aspects of the image at different resolutions (coarse features like shape at low resolutions, fine details like texture at high resolutions).
- Data pipeline: collect at least 50,000 product images per category (shoes, bags, electronics). Clean the dataset rigorously — remove duplicates, watermarks, and low-quality images. Apply standardized backgrounds (white or transparent). Resize to a consistent resolution (512x512 or 1024x1024). Augment with horizontal flips only (not rotations — product orientation matters).
- Training strategy: progressive growing is unnecessary for StyleGAN2+ (the architecture handles it internally). Train with R1 gradient penalty (), non-saturating logistic loss, and path length regularization every 16 minibatches. Use 4-8 GPUs with a total batch size of 32-64. Train for 25M+ images seen (not epochs) and track FID against a held-out validation set. Early stopping when FID plateaus.
- Quality assurance pipeline: (1) automated FID/KID checks against the real dataset, (2) LPIPS-based diversity check (reject batches with mean pairwise LPIPS below a threshold), (3) human evaluation panel rating realism on a 1-5 scale for a random sample of 200 images, (4) A/B testing on the platform — do generated images lead to similar click-through and conversion rates as real product photos?
- Production considerations: serve the generator with ONNX Runtime or TensorRT for 10-50ms inference latency. Cache generated images rather than generating on-the-fly. Implement a moderation pipeline to catch any artifacts or inappropriate content before serving. Version the model and track FID over time to detect quality regression.