Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Neural Architecture Search

Neural Architecture Search: Automating Design

Why Automate Architecture Design?

Every neural network architecture you have learned about so far — ResNet, Inception, EfficientNet — was designed by humans through months or years of trial and error. The question NAS asks: can we automate this process and let machines design better networks than humans can? Designing neural networks manually is:
  • Time-consuming: Weeks or months of human experimentation for a single architecture
  • Expertise-dependent: Requires deep intuition that takes years to develop
  • Suboptimal: Humans can only explore a tiny fraction of the possible design space
  • Biased: We tend to design architectures similar to ones we have seen before
NAS algorithms can:
  • Explore vast architecture spaces systematically — millions of candidates evaluated
  • Find novel, non-intuitive designs — NAS-discovered cells often look nothing like human designs
  • Optimize for specific hardware constraints — latency on a particular phone chip, memory budget
  • Discover architectures that outperform human designs — EfficientNet and MNASNet were NAS-discovered
The honest trade-off: NAS is computationally expensive. The original NAS paper by Zoph and Le (2017) used 800 GPUs for 28 days. While modern methods like DARTS and one-shot NAS have reduced this to single-GPU budgets, NAS is still best suited for high-impact architecture decisions where the found architecture will be deployed at massive scale (e.g., models that run on millions of phones).
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import random
from typing import List, Dict, Tuple, Optional
from collections import namedtuple

torch.manual_seed(42)

NAS Components

1. Search Space

Define what architectures are possible:
# Operation choices for each edge
OPERATIONS = {
    'none': lambda C, stride: Zero(stride),
    'skip_connect': lambda C, stride: Identity() if stride == 1 else FactorizedReduce(C, C),
    'sep_conv_3x3': lambda C, stride: SepConv(C, C, 3, stride, 1),
    'sep_conv_5x5': lambda C, stride: SepConv(C, C, 5, stride, 2),
    'dil_conv_3x3': lambda C, stride: DilConv(C, C, 3, stride, 2, 2),
    'dil_conv_5x5': lambda C, stride: DilConv(C, C, 5, stride, 4, 2),
    'avg_pool_3x3': lambda C, stride: nn.AvgPool2d(3, stride=stride, padding=1),
    'max_pool_3x3': lambda C, stride: nn.MaxPool2d(3, stride=stride, padding=1),
}


class SepConv(nn.Module):
    """Separable convolution."""
    
    def __init__(self, C_in, C_out, kernel_size, stride, padding):
        super().__init__()
        self.op = nn.Sequential(
            nn.ReLU(inplace=False),
            nn.Conv2d(C_in, C_in, kernel_size, stride, padding, groups=C_in, bias=False),
            nn.Conv2d(C_in, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out),
            nn.ReLU(inplace=False),
            nn.Conv2d(C_out, C_out, kernel_size, 1, padding, groups=C_out, bias=False),
            nn.Conv2d(C_out, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out)
        )
    
    def forward(self, x):
        return self.op(x)


class DilConv(nn.Module):
    """Dilated convolution."""
    
    def __init__(self, C_in, C_out, kernel_size, stride, padding, dilation):
        super().__init__()
        self.op = nn.Sequential(
            nn.ReLU(inplace=False),
            nn.Conv2d(C_in, C_in, kernel_size, stride, padding, dilation=dilation, groups=C_in, bias=False),
            nn.Conv2d(C_in, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out)
        )
    
    def forward(self, x):
        return self.op(x)


class Identity(nn.Module):
    def forward(self, x):
        return x


class Zero(nn.Module):
    def __init__(self, stride):
        super().__init__()
        self.stride = stride
    
    def forward(self, x):
        if self.stride == 1:
            return x * 0
        return x[:, :, ::self.stride, ::self.stride] * 0


class FactorizedReduce(nn.Module):
    """Reduce spatial size while maintaining channels."""
    
    def __init__(self, C_in, C_out):
        super().__init__()
        self.conv1 = nn.Conv2d(C_in, C_out // 2, 1, stride=2, bias=False)
        self.conv2 = nn.Conv2d(C_in, C_out // 2, 1, stride=2, bias=False)
        self.bn = nn.BatchNorm2d(C_out)
    
    def forward(self, x):
        out = torch.cat([self.conv1(x), self.conv2(x[:, :, 1:, 1:])], dim=1)
        return self.bn(out)

2. Search Strategy

How to explore the search space:
class SearchStrategy:
    """Base class for NAS search strategies."""
    
    def __init__(self, search_space: Dict):
        self.search_space = search_space
    
    def sample_architecture(self) -> Dict:
        """Sample a random architecture from the search space."""
        raise NotImplementedError
    
    def update(self, architecture: Dict, performance: float):
        """Update search strategy based on evaluation."""
        raise NotImplementedError

3. Performance Estimation

How to evaluate candidate architectures efficiently:
class PerformanceEstimator:
    """Base class for performance estimation strategies."""
    
    def estimate(self, architecture: Dict, dataset) -> float:
        """Estimate performance of an architecture."""
        raise NotImplementedError


class FullTrainingEstimator(PerformanceEstimator):
    """Train model fully - slow but accurate."""
    
    def estimate(self, architecture, dataset, epochs=100):
        model = build_model(architecture)
        train(model, dataset, epochs)
        return evaluate(model, dataset)


class EarlyStoppingEstimator(PerformanceEstimator):
    """Train for few epochs - faster but less accurate."""
    
    def estimate(self, architecture, dataset, epochs=10):
        model = build_model(architecture)
        train(model, dataset, epochs)
        return evaluate(model, dataset)

DARTS was a breakthrough because it made architecture search as fast as training a single model. The trick: instead of discretely choosing one operation per edge (which requires evaluating exponentially many combinations), relax the choice to a continuous weighted sum of ALL operations. Then you can optimize both the architecture weights and the network weights using standard gradient descent. Think of it as a restaurant where you order every item on the menu simultaneously, then gradually increase the weight on the dishes you like and decrease the weight on the ones you do not. At the end, you just keep the highest-weighted dish for each course.

The DARTS Cell

class MixedOp(nn.Module):
    """Mixed operation with learnable architecture weights."""
    
    def __init__(self, C: int, stride: int):
        super().__init__()
        self._ops = nn.ModuleList()
        
        for name, op_fn in OPERATIONS.items():
            op = op_fn(C, stride)
            self._ops.append(op)
    
    def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
        """
        Weighted sum of all operations.
        weights: softmax of architecture parameters
        """
        return sum(w * op(x) for w, op in zip(weights, self._ops))


class DARTSCell(nn.Module):
    """
    DARTS cell with learnable architecture.
    
    A cell is a DAG of nodes where:
    - Node 0, 1: input nodes (from previous cells)
    - Node 2, 3, ...: intermediate nodes
    - Output: concatenation of all intermediate nodes
    """
    
    def __init__(self, num_nodes: int, C: int, reduction: bool):
        super().__init__()
        
        self.num_nodes = num_nodes
        self.reduction = reduction
        
        self._ops = nn.ModuleList()
        
        # Create mixed ops for each edge
        for i in range(num_nodes):
            for j in range(i + 2):  # Connect to all previous nodes
                stride = 2 if reduction and j < 2 else 1
                self._ops.append(MixedOp(C, stride))
        
        # Preprocessing for input nodes
        self.preprocess0 = nn.Sequential(
            nn.ReLU(),
            nn.Conv2d(C, C, 1, bias=False),
            nn.BatchNorm2d(C)
        )
        self.preprocess1 = nn.Sequential(
            nn.ReLU(),
            nn.Conv2d(C, C, 1, bias=False),
            nn.BatchNorm2d(C)
        )
    
    def forward(
        self, 
        s0: torch.Tensor, 
        s1: torch.Tensor, 
        weights: torch.Tensor
    ) -> torch.Tensor:
        """
        Args:
            s0: Output from cell k-2
            s1: Output from cell k-1
            weights: Architecture weights [num_edges, num_ops]
        """
        s0 = self.preprocess0(s0)
        s1 = self.preprocess1(s1)
        
        states = [s0, s1]
        
        offset = 0
        for i in range(self.num_nodes):
            # Sum weighted outputs from all previous nodes
            s = sum(
                self._ops[offset + j](states[j], weights[offset + j])
                for j in range(i + 2)
            )
            offset += i + 2
            states.append(s)
        
        # Concatenate all intermediate nodes
        return torch.cat(states[2:], dim=1)


class DARTSNetwork(nn.Module):
    """Complete DARTS searchable network."""
    
    def __init__(
        self,
        C: int = 16,
        num_classes: int = 10,
        num_layers: int = 8,
        num_nodes: int = 4,
        num_ops: int = 8
    ):
        super().__init__()
        
        self.num_layers = num_layers
        self.num_nodes = num_nodes
        
        # Initial stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, C, 3, padding=1, bias=False),
            nn.BatchNorm2d(C)
        )
        
        # Build cells
        self.cells = nn.ModuleList()
        
        C_curr = C
        reduction_layers = [num_layers // 3, 2 * num_layers // 3]
        
        for i in range(num_layers):
            reduction = i in reduction_layers
            cell = DARTSCell(num_nodes, C_curr, reduction)
            self.cells.append(cell)
            
            if reduction:
                C_curr *= 2
        
        # Classifier
        self.classifier = nn.Linear(C_curr * num_nodes, num_classes)
        
        # Architecture parameters
        num_edges = sum(i + 2 for i in range(num_nodes))
        self.arch_params_normal = nn.Parameter(torch.randn(num_edges, num_ops) * 1e-3)
        self.arch_params_reduce = nn.Parameter(torch.randn(num_edges, num_ops) * 1e-3)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        s0 = s1 = self.stem(x)
        
        for cell in self.cells:
            if cell.reduction:
                weights = F.softmax(self.arch_params_reduce, dim=-1)
            else:
                weights = F.softmax(self.arch_params_normal, dim=-1)
            
            s0, s1 = s1, cell(s0, s1, weights)
        
        out = F.adaptive_avg_pool2d(s1, 1)
        out = out.view(out.size(0), -1)
        return self.classifier(out)
    
    def arch_parameters(self):
        return [self.arch_params_normal, self.arch_params_reduce]
    
    def model_parameters(self):
        return [p for n, p in self.named_parameters() if 'arch_params' not in n]


class DARTSTrainer:
    """Bi-level optimization for DARTS.
    
    The key insight: architecture search is a bilevel optimization problem.
    - Outer level: optimize architecture weights on VALIDATION data
    - Inner level: optimize network weights on TRAINING data
    
    Using validation data for architecture selection prevents the architecture
    from overfitting to the training set (e.g., choosing memorization-friendly
    operations like large fully-connected layers).
    """
    
    def __init__(
        self,
        model: DARTSNetwork,
        train_loader,
        val_loader,
        lr_model: float = 0.025,
        lr_arch: float = 3e-4
    ):
        self.model = model
        self.train_loader = train_loader
        self.val_loader = val_loader
        
        # Two separate optimizers for the two levels of optimization
        # Network weights: SGD with momentum (standard for CNNs)
        self.optimizer_model = torch.optim.SGD(
            model.model_parameters(),
            lr=lr_model,
            momentum=0.9,
            weight_decay=3e-4
        )
        
        # Architecture weights: Adam (works well for the soft attention-like params)
        self.optimizer_arch = torch.optim.Adam(
            model.arch_parameters(),
            lr=lr_arch,
            weight_decay=1e-3
        )
    
    def train_epoch(self):
        """One epoch of bi-level optimization."""
        
        self.model.train()
        train_iter = iter(self.train_loader)
        val_iter = iter(self.val_loader)
        
        for step in range(len(self.train_loader)):
            # Get training batch
            try:
                train_x, train_y = next(train_iter)
            except StopIteration:
                train_iter = iter(self.train_loader)
                train_x, train_y = next(train_iter)
            
            # Get validation batch
            try:
                val_x, val_y = next(val_iter)
            except StopIteration:
                val_iter = iter(self.val_loader)
                val_x, val_y = next(val_iter)
            
            train_x, train_y = train_x.cuda(), train_y.cuda()
            val_x, val_y = val_x.cuda(), val_y.cuda()
            
            # Update architecture parameters on validation data
            self.optimizer_arch.zero_grad()
            val_loss = F.cross_entropy(self.model(val_x), val_y)
            val_loss.backward()
            self.optimizer_arch.step()
            
            # Update model parameters on training data
            self.optimizer_model.zero_grad()
            train_loss = F.cross_entropy(self.model(train_x), train_y)
            train_loss.backward()
            self.optimizer_model.step()
    
    def derive_architecture(self) -> Dict:
        """Extract discrete architecture from continuous weights."""
        
        def parse_cell(weights):
            """Select top-2 operations for each node."""
            gene = []
            
            offset = 0
            for i in range(self.model.num_nodes):
                # Get weights for edges to this node
                node_weights = weights[offset:offset + i + 2]
                
                # Select top-2 edges
                edge_scores = node_weights.max(dim=-1)[0]
                top2 = edge_scores.topk(2)[1]
                
                for j in top2:
                    op_idx = node_weights[j].argmax().item()
                    op_name = list(OPERATIONS.keys())[op_idx]
                    gene.append((op_name, j.item()))
                
                offset += i + 2
            
            return gene
        
        with torch.no_grad():
            normal = parse_cell(F.softmax(self.model.arch_params_normal, dim=-1))
            reduce = parse_cell(F.softmax(self.model.arch_params_reduce, dim=-1))
        
        return {'normal': normal, 'reduce': reduce}

Evolutionary NAS takes inspiration from biological evolution: maintain a population of architectures, evaluate their “fitness” (accuracy), let the best ones reproduce (via mutation), and kill off the weakest. Google’s AmoebaNet, discovered through evolutionary search, achieved state-of-the-art ImageNet accuracy in 2018. The approach is embarrassingly parallel — each candidate can be evaluated on a separate GPU — making it a natural fit for large compute clusters.
@dataclass
class Individual:
    """An individual in the population (an architecture)."""
    architecture: Dict
    fitness: float = 0.0
    age: int = 0


class EvolutionaryNAS:
    """Regularized evolution for architecture search."""
    
    def __init__(
        self,
        search_space: Dict,
        population_size: int = 100,
        tournament_size: int = 10,
        mutation_prob: float = 0.1
    ):
        self.search_space = search_space
        self.population_size = population_size
        self.tournament_size = tournament_size
        self.mutation_prob = mutation_prob
        
        self.population: List[Individual] = []
        self.history: List[Individual] = []
    
    def initialize_population(self):
        """Create initial random population."""
        for _ in range(self.population_size):
            arch = self.sample_random_architecture()
            self.population.append(Individual(architecture=arch))
    
    def sample_random_architecture(self) -> Dict:
        """Sample a random architecture from search space."""
        arch = {}
        for key, choices in self.search_space.items():
            arch[key] = random.choice(choices)
        return arch
    
    def mutate(self, parent: Dict) -> Dict:
        """Mutate an architecture."""
        child = parent.copy()
        
        # Randomly select one gene to mutate
        key = random.choice(list(self.search_space.keys()))
        choices = self.search_space[key]
        child[key] = random.choice(choices)
        
        return child
    
    def tournament_select(self) -> Individual:
        """Select parent via tournament selection."""
        candidates = random.sample(self.population, self.tournament_size)
        return max(candidates, key=lambda x: x.fitness)
    
    def evolve(self, num_generations: int, evaluator: PerformanceEstimator, dataset):
        """Run evolutionary search."""
        
        # Initialize
        self.initialize_population()
        
        # Evaluate initial population
        for ind in self.population:
            ind.fitness = evaluator.estimate(ind.architecture, dataset)
        
        best_fitness = max(ind.fitness for ind in self.population)
        print(f"Generation 0: Best fitness = {best_fitness:.4f}")
        
        for gen in range(num_generations):
            # Select parent
            parent = self.tournament_select()
            
            # Mutate to create child
            child_arch = self.mutate(parent.architecture)
            
            # Evaluate child
            child_fitness = evaluator.estimate(child_arch, dataset)
            child = Individual(architecture=child_arch, fitness=child_fitness)
            
            # Add to population
            self.population.append(child)
            
            # Remove oldest individual (regularized evolution)
            self.population.sort(key=lambda x: x.age)
            oldest = self.population.pop(0)
            
            # Age all individuals
            for ind in self.population:
                ind.age += 1
            
            # Track history
            self.history.append(child)
            
            # Report
            if (gen + 1) % 50 == 0:
                best_fitness = max(ind.fitness for ind in self.population)
                print(f"Generation {gen+1}: Best fitness = {best_fitness:.4f}")
        
        # Return best
        return max(self.population, key=lambda x: x.fitness)


# Example search space
CELL_SEARCH_SPACE = {
    'op_0_0': list(OPERATIONS.keys()),
    'op_0_1': list(OPERATIONS.keys()),
    'op_1_0': list(OPERATIONS.keys()),
    'op_1_1': list(OPERATIONS.keys()),
    'op_1_2': list(OPERATIONS.keys()),
    'op_1_3': list(OPERATIONS.keys()),
    # ... more edges
}

Weight Sharing / One-Shot NAS

The key bottleneck in NAS is evaluation: training a candidate architecture to convergence takes hours or days. Weight sharing solves this by training a single “supernet” that contains all possible architectures simultaneously. Different architectures are just different subsets of the supernet’s weights. After training, you can evaluate any architecture in seconds by simply selecting the relevant weights — no retraining needed. The analogy: instead of building and test-driving a thousand different cars, build one car with every possible engine, every possible transmission, and every possible suspension. Then “test” different configurations by just activating the relevant components.
class SuperNet(nn.Module):
    """
    Supernet with all possible operations.
    Different architectures share weights.
    """
    
    def __init__(self, C: int = 16, num_classes: int = 10, num_layers: int = 8):
        super().__init__()
        
        self.num_layers = num_layers
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, C, 3, padding=1, bias=False),
            nn.BatchNorm2d(C)
        )
        
        # Choice blocks
        self.layers = nn.ModuleList()
        
        for i in range(num_layers):
            stride = 2 if i in [num_layers // 3, 2 * num_layers // 3] else 1
            layer = ChoiceBlock(C, C if stride == 1 else C * 2, stride)
            self.layers.append(layer)
            
            if stride == 2:
                C *= 2
        
        # Classifier
        self.classifier = nn.Linear(C, num_classes)
    
    def forward(self, x: torch.Tensor, architecture: List[int]) -> torch.Tensor:
        """
        Forward with specific architecture.
        architecture: List of operation indices, one per layer.
        """
        x = self.stem(x)
        
        for layer, op_idx in zip(self.layers, architecture):
            x = layer(x, op_idx)
        
        x = F.adaptive_avg_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        return self.classifier(x)
    
    def sample_architecture(self) -> List[int]:
        """Sample random architecture."""
        return [random.randint(0, len(OPERATIONS) - 1) for _ in range(self.num_layers)]


class ChoiceBlock(nn.Module):
    """Block with multiple operation choices."""
    
    def __init__(self, C_in: int, C_out: int, stride: int):
        super().__init__()
        
        self.ops = nn.ModuleList([
            op_fn(C_in, stride) for name, op_fn in OPERATIONS.items()
        ])
        
        # Project if needed
        self.proj = None
        if C_in != C_out or stride != 1:
            self.proj = nn.Sequential(
                nn.Conv2d(C_in, C_out, 1, stride=stride, bias=False),
                nn.BatchNorm2d(C_out)
            )
    
    def forward(self, x: torch.Tensor, op_idx: int) -> torch.Tensor:
        out = self.ops[op_idx](x)
        
        if self.proj is not None:
            out = self.proj(out)
        
        return out


class OneShotNAS:
    """One-shot NAS with weight sharing."""
    
    def __init__(
        self,
        supernet: SuperNet,
        train_loader,
        val_loader,
        lr: float = 0.025
    ):
        self.supernet = supernet
        self.train_loader = train_loader
        self.val_loader = val_loader
        
        self.optimizer = torch.optim.SGD(
            supernet.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=3e-4
        )
    
    def train_supernet(self, epochs: int = 50):
        """Train supernet with random architecture sampling."""
        
        for epoch in range(epochs):
            self.supernet.train()
            
            for x, y in self.train_loader:
                x, y = x.cuda(), y.cuda()
                
                # Sample random architecture
                arch = self.supernet.sample_architecture()
                
                # Forward
                self.optimizer.zero_grad()
                out = self.supernet(x, arch)
                loss = F.cross_entropy(out, y)
                loss.backward()
                self.optimizer.step()
            
            # Evaluate random architecture
            val_acc = self.evaluate_architecture(self.supernet.sample_architecture())
            print(f"Epoch {epoch+1}: Val accuracy = {val_acc:.2f}%")
    
    def evaluate_architecture(self, architecture: List[int]) -> float:
        """Evaluate a specific architecture on validation set."""
        
        self.supernet.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for x, y in self.val_loader:
                x, y = x.cuda(), y.cuda()
                out = self.supernet(x, architecture)
                pred = out.argmax(dim=1)
                correct += (pred == y).sum().item()
                total += y.size(0)
        
        return 100 * correct / total
    
    def search(self, num_samples: int = 1000) -> Tuple[List[int], float]:
        """Random search over trained supernet."""
        
        best_arch = None
        best_acc = 0
        
        for _ in range(num_samples):
            arch = self.supernet.sample_architecture()
            acc = self.evaluate_architecture(arch)
            
            if acc > best_acc:
                best_arch = arch
                best_acc = acc
        
        return best_arch, best_acc

Hardware-Aware NAS

In production, accuracy alone is not enough. A model that achieves 80% accuracy in 5ms on a phone is often more valuable than one that achieves 82% accuracy in 500ms. Hardware-aware NAS adds latency, FLOPs, memory, and energy consumption as explicit optimization objectives alongside accuracy. Google’s MNASNet and EfficientNet families were discovered this way, specifically optimized for mobile inference latency.
class HardwareAwareNAS:
    """NAS with hardware constraints."""
    
    def __init__(
        self,
        search_space: Dict,
        target_latency_ms: float = 10.0,
        target_flops: float = 300e6
    ):
        self.search_space = search_space
        self.target_latency = target_latency_ms
        self.target_flops = target_flops
        
        # Latency lookup table (measured on target hardware)
        self.latency_lut = self._build_latency_lut()
    
    def _build_latency_lut(self) -> Dict:
        """Build latency lookup table for operations."""
        lut = {}
        for op_name in OPERATIONS.keys():
            # Measure latency for each operation
            lut[op_name] = self._measure_latency(op_name)
        return lut
    
    def _measure_latency(self, op_name: str, input_size: Tuple = (1, 64, 56, 56)) -> float:
        """Measure operation latency on target device."""
        import time
        
        op = OPERATIONS[op_name](64, 1)
        x = torch.randn(*input_size)
        
        # Warmup
        for _ in range(10):
            _ = op(x)
        
        # Measure
        times = []
        for _ in range(100):
            start = time.time()
            _ = op(x)
            times.append((time.time() - start) * 1000)  # ms
        
        return np.median(times)
    
    def estimate_latency(self, architecture: Dict) -> float:
        """Estimate total latency of architecture."""
        total = 0
        for key, op_name in architecture.items():
            total += self.latency_lut[op_name]
        return total
    
    def estimate_flops(self, architecture: Dict, input_size: Tuple = (224, 224)) -> float:
        """Estimate FLOPs of architecture."""
        # Use analytical formulas or libraries like fvcore
        pass
    
    def multi_objective_fitness(
        self,
        architecture: Dict,
        accuracy: float,
        alpha: float = 0.1,
        beta: float = 0.1
    ) -> float:
        """
        Multi-objective fitness: accuracy with latency/FLOPs penalty.
        """
        latency = self.estimate_latency(architecture)
        flops = self.estimate_flops(architecture) or 0
        
        # Soft penalty
        latency_penalty = max(0, latency - self.target_latency) / self.target_latency
        flops_penalty = max(0, flops - self.target_flops) / self.target_flops
        
        fitness = accuracy - alpha * latency_penalty - beta * flops_penalty
        
        return fitness


class ProxylessNAS:
    """
    ProxylessNAS: directly search on target task and hardware.
    Uses path-level binarization to reduce memory.
    """
    
    def __init__(self, supernet: SuperNet, latency_lut: Dict, target_latency: float):
        self.supernet = supernet
        self.latency_lut = latency_lut
        self.target_latency = target_latency
        
        # Architecture parameters
        self.arch_params = nn.ParameterList([
            nn.Parameter(torch.zeros(len(OPERATIONS)))
            for _ in range(supernet.num_layers)
        ])
    
    def sample_path(self) -> Tuple[List[int], torch.Tensor]:
        """Sample a single path (binary architecture)."""
        path = []
        log_probs = []
        
        for params in self.arch_params:
            probs = F.softmax(params, dim=0)
            dist = torch.distributions.Categorical(probs)
            idx = dist.sample()
            path.append(idx.item())
            log_probs.append(dist.log_prob(idx))
        
        return path, torch.stack(log_probs).sum()
    
    def latency_loss(self, path: List[int]) -> torch.Tensor:
        """Compute latency loss for path."""
        latency = sum(
            self.latency_lut[list(OPERATIONS.keys())[idx]]
            for idx in path
        )
        return F.relu(torch.tensor(latency - self.target_latency))

Practical NAS: Once-for-All

class OnceForAllNetwork(nn.Module):
    """
    Once-for-All: Train once, deploy everywhere.
    Supports dynamic depth, width, and resolution.
    """
    
    def __init__(
        self,
        base_channels: int = 64,
        max_layers: int = 12,
        width_mult_list: List[float] = [0.5, 0.75, 1.0],
        depth_list: List[int] = [2, 3, 4]
    ):
        super().__init__()
        
        self.max_layers = max_layers
        self.width_mult_list = width_mult_list
        self.depth_list = depth_list
        
        # Build maximum network
        self.layers = nn.ModuleList()
        C = base_channels
        
        for i in range(max_layers):
            # Each layer supports multiple widths
            layer = ElasticConvBlock(
                C, C,
                width_mult_list=width_mult_list
            )
            self.layers.append(layer)
        
        # Classifier supports multiple widths
        max_width = int(C * max(width_mult_list))
        self.classifier = nn.Linear(max_width, 1000)
    
    def forward(
        self,
        x: torch.Tensor,
        depth: int,
        width_mult: float
    ) -> torch.Tensor:
        """
        Forward with specific subnet configuration.
        """
        # Use only first `depth` layers
        for i in range(depth):
            x = self.layers[i](x, width_mult)
        
        x = F.adaptive_avg_pool2d(x, 1)
        x = x.view(x.size(0), -1)
        
        # Slice classifier weights for width
        out_features = int(self.classifier.in_features * width_mult)
        weight = self.classifier.weight[:, :out_features]
        x = F.linear(x[:, :out_features], weight, self.classifier.bias)
        
        return x
    
    def sample_subnet(self) -> Dict:
        """Sample a random subnet configuration."""
        return {
            'depth': random.choice(self.depth_list),
            'width_mult': random.choice(self.width_mult_list)
        }


class ElasticConvBlock(nn.Module):
    """Convolution block that supports elastic width."""
    
    def __init__(self, C_in: int, C_out: int, width_mult_list: List[float]):
        super().__init__()
        
        max_C_out = int(C_out * max(width_mult_list))
        
        self.conv = nn.Conv2d(C_in, max_C_out, 3, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(max_C_out)
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x: torch.Tensor, width_mult: float) -> torch.Tensor:
        out = self.conv(x)
        
        # Slice to active width
        active_C = int(self.conv.out_channels * width_mult)
        out = out[:, :active_C]
        
        out = self.bn(out)
        out = self.relu(out)
        
        return out

Exercises

Implement random search baseline:
def random_search_nas(search_space, evaluator, n_trials=1000):
    best_arch = None
    best_score = 0
    
    for _ in range(n_trials):
        arch = sample_random(search_space)
        score = evaluator(arch)
        if score > best_score:
            best_arch, best_score = arch, score
    
    return best_arch
Train a performance predictor to speed up search:
class PerformancePredictor(nn.Module):
    def __init__(self, encoding_dim):
        # Encode architecture as vector
        # Predict accuracy from encoding
        pass
Implement Pareto-optimal architecture search:
def pareto_optimal_search(population, objectives):
    # objectives = ['accuracy', 'latency', 'params']
    # Find Pareto frontier
    # Return non-dominated architectures

What’s Next?

Interpretability

Understand what your models learn

Adversarial Robustness

Defend against adversarial attacks

Interview Deep-Dive

Strong Answer:Original NAS (Zoph and Le, 2017) used RL to train a controller that proposes architectures. Each candidate was trained from scratch and evaluated to compute reward. This required thousands of full training runs — 800 GPUs for 28 days, roughly $250K in compute.DARTS reformulates the discrete search as continuous optimization. Instead of choosing one operation per edge, DARTS places all candidates in parallel with learnable mixing weights, optimized via gradient descent alongside network weights. After search, the highest-weight operation on each edge is selected.This reduces cost from thousands of runs to approximately one run. DARTS finds competitive architectures in 1-4 GPU-days — a 1000x reduction.The trade-off: DARTS suffers from “collapse” where search converges to skip-connection-heavy architectures that are easy to optimize in the supernet but have low capacity. Skip connections have the smallest optimization footprint, biasing the search. Fixes include fair DARTS and progressive DARTS.Follow-up: When would you actually use NAS versus picking a known architecture?Two scenarios justify NAS cost. First, deploying to specific hardware with unusual constraints (particular mobile chip, FPGA) where NAS can optimize jointly for accuracy and hardware-specific latency. Second, designing an architecture family for massive-scale deployment where upfront search cost is amortized across billions of inferences. For a one-off project or small-scale deployment, picking EfficientNet or ResNet off the shelf is almost always correct.
Strong Answer:The search space defines what NAS can discover, typically at three levels: macro (layer count, topology), cell-level (operations within each repeated cell), and operation-level (specific convolution, pooling, or connection types).Too large: the algorithm wastes budget exploring terrible architectures and may never find the good region. The search becomes a needle-in-a-haystack problem. Degenerate architectures (all skip connections, extreme depths) consume evaluation budget.Too small: you bake in human assumptions and NAS discovers nothing novel. Restricting to “ResNet-like blocks with 3x3 convs” just yields the best ResNet variant, missing entirely different designs.The practical approach: cell-based search with 5-8 operation types and 2-4 intermediate nodes per cell. This gives roughly 10^9 candidates — large enough for novelty, small enough for DARTS to search in a few GPU-days. The macro architecture follows a human-designed stacking template while cell internals are fully searched.Follow-up: How do you validate a NAS-found architecture is genuinely better?Always retrain from scratch on the full dataset. DARTS-style supernet weights may not reflect independent training performance. Protocol: (1) run NAS on a proxy task (smaller data, fewer epochs), (2) train discovered architecture from scratch with full recipe, (3) compare against baselines trained identically. If NAS architecture does not outperform under identical conditions, the search was not useful. Verify across multiple random seeds since NAS results have high variance.
Strong Answer:One-shot NAS trains a single supernet containing all candidate architectures as sub-networks with shared weights. After training once, candidate architectures are evaluated by extracting their sub-network — essentially free evaluation via a single forward pass.Weight sharing works because supernet weights, trained across many configurations, encode broadly useful feature extractors. The sub-network ranking within the supernet should correlate with independently-trained rankings.The main failure mode is rank correlation breakdown. Shared weights create interference — the weights compromise across many configurations, optimal for none. A sub-network performing well with shared weights might underperform independently trained. Kendall tau correlation between supernet and independent rankings is often only 0.5-0.7.Mitigations: train the supernet longer, use progressive shrinking, or use supernet ranking as coarse filter then train the top 5-10 candidates from scratch.Follow-up: How does predictor-based NAS compare?Predictor-based NAS trains a surrogate model (MLP or GNN) to predict performance from architecture encoding. Train 200-500 seed architectures, fit the predictor, then score thousands of candidates in microseconds. This avoids weight-sharing interference entirely and typically achieves higher rank correlation. The “train a few, predict the rest” approach is increasingly popular for its simplicity, reliability, and parallelizability.