Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Adversarial Machine Learning
The Vulnerability of Neural Networks
Neural networks are surprisingly vulnerable to adversarial examples — inputs crafted to cause misclassification while appearing normal to humans. Consider this: a model that classifies a panda image with 99.9% confidence can be made to classify it as a gibbon with even higher confidence, by adding a perturbation so small that a human cannot see the difference. We’re talking about changes of less than 1/255 per pixel — invisible to the naked eye, devastating to the model. Why does this matter in production? Adversarial attacks aren’t just an academic curiosity. Self-driving cars need to correctly classify stop signs even when someone sticks a small adversarial patch on them. Medical imaging systems must resist subtle pixel perturbations that could change a “benign” diagnosis to “malignant.” Content moderation systems must resist adversarial bypasses. Any model deployed in a safety-critical or adversarial environment needs to be tested against these attacks.Adversarial Attacks
Fast Gradient Sign Method (FGSM)
The foundational one-step attack by Goodfellow et al. (2014). FGSM is to adversarial ML what “Hello World” is to programming — the simplest possible attack, yet surprisingly effective: Intuition: Compute the gradient of the loss with respect to each input pixel — this tells you which direction to nudge each pixel to maximally increase the classification error. Then take just the sign of each gradient (+1 or -1) and scale by . The sign operation means every pixel changes by exactly , which maximizes the L-infinity perturbation within the budget. It’s a single backward pass — as cheap as one training step.Projected Gradient Descent (PGD)
The strongest first-order attack — iterative FGSM with projection. If FGSM takes one big step, PGD takes many small steps, recalculating the gradient at each point. This is much more effective because the loss landscape around a data point is highly non-linear — a single gradient step often overshoots the optimal perturbation direction.C&W Attack
Carlini and Wagner (2017) — a fundamentally different approach. Instead of constraining the perturbation and maximizing the loss (like PGD), C&W jointly optimizes for minimal perturbation AND misclassification using a Lagrangian formulation. This finds smaller perturbations than PGD, though at much higher computational cost (1000+ optimization steps vs 20-40 for PGD). C&W also uses the tanh-space reparameterization to handle the [0,1] box constraint elegantly:Adversarial Defenses
Adversarial Training
The most effective empirical defense — and conceptually the simplest: train on adversarial examples. At each training step, generate adversarial perturbations of the current batch using PGD, then update the model weights to correctly classify those adversarial examples. The model learns to be robust by constantly facing worst-case inputs. The cost: adversarial training is 5-10x slower than standard training because each training step requires running PGD (multiple forward+backward passes for attack generation) before the actual weight update. For CIFAR-10, this means training takes 1-2 days on a single GPU instead of a few hours.Input Preprocessing Defenses
Certified Defenses
Randomized Smoothing
The only scalable approach to certified (provable) robustness on ImageNet-scale models. The core idea: if a classifier gives the same prediction under many random Gaussian perturbations of the input, then the prediction must also be correct for any adversarial perturbation within a certifiable L2 radius. Unlike adversarial training (which is empirical — no guarantees), randomized smoothing gives a mathematical certificate: “no adversarial perturbation within radius can change this prediction.”Robust Architecture Design
Robustness Evaluation
Exercises
Exercise 1: Implement Square Attack
Exercise 1: Implement Square Attack
Exercise 2: TRADES vs PGD Training
Exercise 2: TRADES vs PGD Training
- Train models with both methods
- Compare clean vs robust accuracy tradeoff
- Evaluate with AutoAttack
Exercise 3: Certified Smoothing
Exercise 3: Certified Smoothing
- Train a smoothed classifier
- Compute certified radii
- Plot certified accuracy vs radius
Training Tips
Interview Deep-Dive
Explain FGSM and PGD attacks. Why is PGD considered the 'gold standard' for robustness evaluation, and what are its limitations?
Explain FGSM and PGD attacks. Why is PGD considered the 'gold standard' for robustness evaluation, and what are its limitations?
- FGSM (Fast Gradient Sign Method): a single-step attack. Compute the gradient of the loss with respect to the input, take its sign, and scale by epsilon. The adversarial image is . It’s fast (one forward + one backward pass) but weak because the loss landscape is highly non-linear and a single gradient step often doesn’t find the worst-case perturbation.
- PGD (Projected Gradient Descent): an iterative version of FGSM. Take many small FGSM steps (step size alpha, typically alpha = 2.5 * epsilon / num_iter), and after each step project the perturbation back into the L-infinity epsilon ball. Start from a random point within the ball (random restart) to avoid local optima. With enough iterations (20-40) and random restarts (10+), PGD is provably near-optimal among first-order attacks.
- Why PGD is the gold standard: Madry et al. (2018) showed that PGD’s inner maximization problem is approximately solved by iterative projected gradient ascent, and that adversarial training against PGD produces models robust to all first-order attacks. The key theoretical insight is that the adversarial loss landscape, while non-convex, has no problematic local maxima in practice — PGD reliably finds near-optimal adversarial examples.
- Limitations of PGD: (1) it only finds first-order adversarial examples — second-order attacks or optimization-based attacks (C&W) can sometimes find smaller perturbations; (2) PGD is slow for large epsilon budgets or high-resolution images; (3) PGD evaluates L-infinity robustness by default, but real-world attacks may use other threat models (L2, spatial transformations, color shifts). For this reason, AutoAttack (a standardized ensemble of four diverse attacks) has become the recommended evaluation protocol.
- A senior engineer would note: the number of PGD restarts matters enormously. Evaluating with PGD-20 (20 steps, no restarts) can overestimate robustness by 5-10% compared to PGD-20 with 10 random restarts. Always report the attack configuration precisely.
What is the accuracy-robustness trade-off? Is it fundamental or just an artifact of current methods? Explain with concrete numbers.
What is the accuracy-robustness trade-off? Is it fundamental or just an artifact of current methods? Explain with concrete numbers.
- The empirical observation: adversarially trained models consistently achieve lower clean accuracy than standard models. On CIFAR-10: standard training gets approximately 95% clean accuracy, while PGD adversarial training (epsilon=8/255) gets approximately 85% clean accuracy and approximately 50-55% PGD-robust accuracy. On ImageNet: standard models hit approximately 80% top-1; adversarially robust models hit approximately 65% clean and approximately 35% robust.
- Is it fundamental? There is growing theoretical and empirical evidence that the trade-off is inherent to the problem, not just a limitation of current algorithms. Tsipras et al. (2019) showed that in certain data distributions, robust classifiers must use fundamentally different features than accurate classifiers — robust features are more semantically meaningful but less predictive. Zhang et al. (TRADES, 2019) proved a decomposition: robustness error is bounded by the sum of clean error and a boundary complexity term, suggesting you can’t minimize both simultaneously.
- The concrete reason: standard models exploit “non-robust features” — statistical patterns in the data that are predictive of the class label but are fragile under small perturbations. These features actually contain real signal (not just noise), which is why standard models that use them achieve higher accuracy. Adversarial training forces the model to ignore these features and rely only on “robust features” (patterns that survive perturbation), which are fewer and less discriminative.
- TRADES addresses the trade-off explicitly: its loss function lets you tune beta to control the clean-robust balance. Higher beta means more robustness at the cost of clean accuracy. The optimal beta depends on the deployment context — a self-driving car system should favor robustness; a photo tagging system might favor clean accuracy.
- A senior engineer would add: in production, the trade-off means you need separate models for adversarial and non-adversarial settings, or an ensemble that routes inputs based on threat detection. Don’t deploy a single adversarially trained model for all use cases — you’re paying the clean accuracy cost even when there’s no adversary.
Compare empirical defenses (adversarial training) with certified defenses (randomized smoothing). When would you use each in production?
Compare empirical defenses (adversarial training) with certified defenses (randomized smoothing). When would you use each in production?
- Adversarial training (empirical): train on PGD-generated adversarial examples. Achieves the best empirical robustness — approximately 60% robust accuracy on CIFAR-10 at epsilon=8/255 for state-of-the-art models (WideResNet-70-16 with extra data). No formal guarantees: a sufficiently clever attacker might find perturbations that break the model. Training cost is 5-10x standard training.
- Randomized smoothing (certified): wrap any base classifier with Gaussian noise averaging. Provides a formal certificate: “for this specific input, no L2 perturbation within radius can change the prediction.” Certified accuracy on CIFAR-10 at L2 epsilon=0.5: approximately 60%. The downside: inference requires 100-10,000 forward passes per input (one per noise sample), making it 100-10,000x slower. And the certificates are per-input — some inputs get large radii, others get small radii or abstain.
- When to use adversarial training: when you need low-latency inference and are defending against known threat models (e.g., L-infinity perturbations). The lack of formal guarantees is acceptable when the attack surface is well-characterized and you evaluate with AutoAttack. Best for: image classification, content moderation, any system where you can tolerate empirical robustness.
- When to use randomized smoothing: when you need a formal guarantee that a specific prediction is correct, regardless of what the attacker does. The guarantee is legally and contractually meaningful — you can certify that “this medical image classification is provably correct within this perturbation radius.” Best for: safety-critical systems (medical imaging, autonomous vehicles), regulatory compliance, or as a certification layer on top of adversarial training.
- Hybrid approach: adversarially train the base classifier, then wrap it with randomized smoothing. This gives you the best of both worlds: strong empirical robustness from adversarial training, plus formal certificates from smoothing. The certified radii are larger than smoothing alone because the base classifier is already robust to moderate perturbations.
A colleague proposes defending your image classifier by adding JPEG compression as a preprocessing step before inference. Evaluate this defense.
A colleague proposes defending your image classifier by adding JPEG compression as a preprocessing step before inference. Evaluate this defense.
- The appeal: JPEG compression removes high-frequency components from images, and adversarial perturbations often contain high-frequency patterns. In initial testing against standard FGSM or PGD, JPEG preprocessing may appear to reduce attack success rate by 20-40%. It’s also trivial to implement — literally one line of code.
- Why it fails: this defense has been thoroughly broken by adaptive attacks. The key insight: if the attacker knows JPEG compression is being applied (which we must assume per Kerckhoffs’ principle), they can incorporate the JPEG operation into their attack loop. JPEG is differentiable (or can be approximated with a differentiable proxy), so PGD through the JPEG layer finds adversarial examples that survive compression. In the original paper by Dziugaite et al. and later analysis by Athalye et al. (2018, “Obfuscated Gradients”), JPEG defense was reduced to near-zero effectiveness against adaptive attacks.
- The general principle — “obfuscated gradients”: JPEG compression is one example of a broader failure mode: defenses that appear to work because they break the gradient computation that attacks rely on, without actually making the model robust. Three types: (1) shattered gradients (non-differentiable operations like JPEG), (2) stochastic gradients (random transformations), (3) vanishing/exploding gradients (very deep preprocessing). All three have been systematically broken using techniques like Expectation over Transformations (EoT), backward pass differentiable approximations (BPDA), or C&W-style optimization that bypasses gradients entirely.
- What to recommend instead: adversarial training is the only preprocessing-free defense with sustained empirical success. If the colleague wants a lightweight defense, suggest certified defenses (randomized smoothing) or at minimum, ensemble adversarial training. But always evaluate against adaptive attacks before claiming robustness.
- A senior engineer would add: the history of adversarial defenses is littered with papers that claimed robustness, got accepted to top venues, and were broken within months by adaptive attacks. The lesson: never evaluate a defense only against standard (non-adaptive) attacks. Always assume the attacker knows your defense and has white-box access to the model. If your defense only works against oblivious attackers, it’s not a defense — it’s security through obscurity.
Design a robustness evaluation pipeline for a deployed image classification model. What attacks would you run, in what order, and what thresholds would you set?
Design a robustness evaluation pipeline for a deployed image classification model. What attacks would you run, in what order, and what thresholds would you set?
- Step 1: Establish clean accuracy baseline. Evaluate on the standard test set without any attack. This is your upper bound. Record top-1 and top-5 accuracy plus per-class accuracy (robustness often varies significantly across classes).
- Step 2: FGSM evaluation (weak attack, fast sanity check). Run FGSM at the standard epsilon for your domain (8/255 for CIFAR-10-like, 4/255 for ImageNet-like). If the model is not robust to FGSM, there’s no point running stronger attacks — go directly to adversarial training. FGSM takes seconds to run on the full test set.
- Step 3: PGD-20 with 5 random restarts (strong first-order attack). This is the standard benchmark attack. Report accuracy at the standard epsilon. Expected robust accuracy for a well-trained adversarially robust model: 50-60% on CIFAR-10, 30-40% on ImageNet. This step takes 10-30 minutes on a GPU.
- Step 4: AutoAttack (standardized evaluation). Run the full AutoAttack suite: APGD-CE, APGD-DLR, FAB attack, and Square Attack (black-box). AutoAttack is the community standard for robustness claims — results are comparable across papers. This takes 1-4 hours on a GPU.
- Step 5: Robustness curve. Plot accuracy vs epsilon for epsilon in [0, 0.01, 0.02, …, 0.1]. This shows the full picture: at what perturbation level does the model break? The curve should degrade gracefully, not cliff-dive at a specific epsilon.
- Thresholds (CIFAR-10, epsilon=8/255): AutoAttack robust accuracy above 50% is competitive, above 55% is strong, above 60% is state-of-the-art (as of 2025, per RobustBench leaderboard). Clean accuracy should be above 80% (below 80% suggests the robustness came at too high a clean accuracy cost, or training went wrong).
- Step 6: Per-class robustness analysis. Some classes are inherently harder to defend (e.g., “cat” vs “dog” are more confusable than “airplane” vs “frog”). Report per-class robust accuracy and flag classes where robustness drops below a minimum threshold.
- Production integration: run this pipeline as a CI/CD step on every model checkpoint. Track robustness metrics over time on a dashboard alongside clean accuracy. Set alerts for robustness regression (e.g., AutoAttack accuracy drops more than 2% between model versions).