Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Neural Architecture Search: Automating Design
Why Automate Architecture Design?
Every neural network architecture you have learned about so far — ResNet, Inception, EfficientNet — was designed by humans through months or years of trial and error. The question NAS asks: can we automate this process and let machines design better networks than humans can? Designing neural networks manually is:- Time-consuming: Weeks or months of human experimentation for a single architecture
- Expertise-dependent: Requires deep intuition that takes years to develop
- Suboptimal: Humans can only explore a tiny fraction of the possible design space
- Biased: We tend to design architectures similar to ones we have seen before
- Explore vast architecture spaces systematically — millions of candidates evaluated
- Find novel, non-intuitive designs — NAS-discovered cells often look nothing like human designs
- Optimize for specific hardware constraints — latency on a particular phone chip, memory budget
- Discover architectures that outperform human designs — EfficientNet and MNASNet were NAS-discovered
NAS Components
1. Search Space
Define what architectures are possible:2. Search Strategy
How to explore the search space:3. Performance Estimation
How to evaluate candidate architectures efficiently:DARTS: Differentiable Architecture Search
DARTS was a breakthrough because it made architecture search as fast as training a single model. The trick: instead of discretely choosing one operation per edge (which requires evaluating exponentially many combinations), relax the choice to a continuous weighted sum of ALL operations. Then you can optimize both the architecture weights and the network weights using standard gradient descent. Think of it as a restaurant where you order every item on the menu simultaneously, then gradually increase the weight on the dishes you like and decrease the weight on the ones you do not. At the end, you just keep the highest-weighted dish for each course.The DARTS Cell
Evolutionary Search
Evolutionary NAS takes inspiration from biological evolution: maintain a population of architectures, evaluate their “fitness” (accuracy), let the best ones reproduce (via mutation), and kill off the weakest. Google’s AmoebaNet, discovered through evolutionary search, achieved state-of-the-art ImageNet accuracy in 2018. The approach is embarrassingly parallel — each candidate can be evaluated on a separate GPU — making it a natural fit for large compute clusters.Weight Sharing / One-Shot NAS
The key bottleneck in NAS is evaluation: training a candidate architecture to convergence takes hours or days. Weight sharing solves this by training a single “supernet” that contains all possible architectures simultaneously. Different architectures are just different subsets of the supernet’s weights. After training, you can evaluate any architecture in seconds by simply selecting the relevant weights — no retraining needed. The analogy: instead of building and test-driving a thousand different cars, build one car with every possible engine, every possible transmission, and every possible suspension. Then “test” different configurations by just activating the relevant components.Hardware-Aware NAS
In production, accuracy alone is not enough. A model that achieves 80% accuracy in 5ms on a phone is often more valuable than one that achieves 82% accuracy in 500ms. Hardware-aware NAS adds latency, FLOPs, memory, and energy consumption as explicit optimization objectives alongside accuracy. Google’s MNASNet and EfficientNet families were discovered this way, specifically optimized for mobile inference latency.Practical NAS: Once-for-All
Exercises
Exercise 1: Implement Random Search NAS
Exercise 1: Implement Random Search NAS
Implement random search baseline:
Exercise 2: Add Predictor-Based NAS
Exercise 2: Add Predictor-Based NAS
Train a performance predictor to speed up search:
Exercise 3: Multi-Objective NAS
Exercise 3: Multi-Objective NAS
Implement Pareto-optimal architecture search:
What’s Next?
Interpretability
Understand what your models learn
Adversarial Robustness
Defend against adversarial attacks
Interview Deep-Dive
Why is NAS computationally expensive, and how do methods like DARTS make it tractable?
Why is NAS computationally expensive, and how do methods like DARTS make it tractable?
Strong Answer:Original NAS (Zoph and Le, 2017) used RL to train a controller that proposes architectures. Each candidate was trained from scratch and evaluated to compute reward. This required thousands of full training runs — 800 GPUs for 28 days, roughly $250K in compute.DARTS reformulates the discrete search as continuous optimization. Instead of choosing one operation per edge, DARTS places all candidates in parallel with learnable mixing weights, optimized via gradient descent alongside network weights. After search, the highest-weight operation on each edge is selected.This reduces cost from thousands of runs to approximately one run. DARTS finds competitive architectures in 1-4 GPU-days — a 1000x reduction.The trade-off: DARTS suffers from “collapse” where search converges to skip-connection-heavy architectures that are easy to optimize in the supernet but have low capacity. Skip connections have the smallest optimization footprint, biasing the search. Fixes include fair DARTS and progressive DARTS.Follow-up: When would you actually use NAS versus picking a known architecture?Two scenarios justify NAS cost. First, deploying to specific hardware with unusual constraints (particular mobile chip, FPGA) where NAS can optimize jointly for accuracy and hardware-specific latency. Second, designing an architecture family for massive-scale deployment where upfront search cost is amortized across billions of inferences. For a one-off project or small-scale deployment, picking EfficientNet or ResNet off the shelf is almost always correct.
How would you define the search space for NAS? What are the risks of making it too large or too small?
How would you define the search space for NAS? What are the risks of making it too large or too small?
Strong Answer:The search space defines what NAS can discover, typically at three levels: macro (layer count, topology), cell-level (operations within each repeated cell), and operation-level (specific convolution, pooling, or connection types).Too large: the algorithm wastes budget exploring terrible architectures and may never find the good region. The search becomes a needle-in-a-haystack problem. Degenerate architectures (all skip connections, extreme depths) consume evaluation budget.Too small: you bake in human assumptions and NAS discovers nothing novel. Restricting to “ResNet-like blocks with 3x3 convs” just yields the best ResNet variant, missing entirely different designs.The practical approach: cell-based search with 5-8 operation types and 2-4 intermediate nodes per cell. This gives roughly 10^9 candidates — large enough for novelty, small enough for DARTS to search in a few GPU-days. The macro architecture follows a human-designed stacking template while cell internals are fully searched.Follow-up: How do you validate a NAS-found architecture is genuinely better?Always retrain from scratch on the full dataset. DARTS-style supernet weights may not reflect independent training performance. Protocol: (1) run NAS on a proxy task (smaller data, fewer epochs), (2) train discovered architecture from scratch with full recipe, (3) compare against baselines trained identically. If NAS architecture does not outperform under identical conditions, the search was not useful. Verify across multiple random seeds since NAS results have high variance.
What is the one-shot NAS approach? What is its main failure mode?
What is the one-shot NAS approach? What is its main failure mode?
Strong Answer:One-shot NAS trains a single supernet containing all candidate architectures as sub-networks with shared weights. After training once, candidate architectures are evaluated by extracting their sub-network — essentially free evaluation via a single forward pass.Weight sharing works because supernet weights, trained across many configurations, encode broadly useful feature extractors. The sub-network ranking within the supernet should correlate with independently-trained rankings.The main failure mode is rank correlation breakdown. Shared weights create interference — the weights compromise across many configurations, optimal for none. A sub-network performing well with shared weights might underperform independently trained. Kendall tau correlation between supernet and independent rankings is often only 0.5-0.7.Mitigations: train the supernet longer, use progressive shrinking, or use supernet ranking as coarse filter then train the top 5-10 candidates from scratch.Follow-up: How does predictor-based NAS compare?Predictor-based NAS trains a surrogate model (MLP or GNN) to predict performance from architecture encoding. Train 200-500 seed architectures, fit the predictor, then score thousands of candidates in microseconds. This avoids weight-sharing interference entirely and typically achieves higher rank correlation. The “train a few, predict the rest” approach is increasingly popular for its simplicity, reliability, and parallelizability.