Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Evolution of CNN Architectures
Timeline of Innovation
The history of CNN architectures reads like a series of clever solutions to a single question: how do you build deeper, more capable networks without running into vanishing gradients, computational blowup, or overfitting? Each architecture below introduced a key insight that changed how practitioners think about network design.VGGNet: Simplicity and Depth
Design Philosophy
VGGNet proved that depth with small filters beats shallow networks with large filters. The brilliance of VGGNet is its simplicity — one design pattern repeated uniformly throughout the network. This made VGG the first architecture that felt “principled” rather than hand-tuned. Think of VGG like a well-run factory assembly line: every station does the same type of work (3x3 convolution), and the product (feature map) gets progressively refined as it moves through. No special-purpose machinery, no branching paths — just disciplined repetition. This uniformity made VGG the go-to backbone for transfer learning for years, precisely because its behavior was so predictable. Key insights:- Use only 3x3 convolutions — two stacked 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters and more non-linearities
- Double channels when halving spatial dimensions — as spatial resolution decreases (via pooling), increase representational capacity (via more channels)
- Simple, uniform architecture — no special cases or complex branching, making it easy to understand and modify
Why 3×3 Filters?
Two 3×3 convolutions = one 5×5 receptive field, but:- Fewer parameters: vs
- More non-linearities: 2 ReLU activations vs 1
- Better regularization: More structured model
GoogLeNet/Inception: Multi-Scale Processing
The Inception Module
The key question GoogLeNet asked: why choose between a 1x1, 3x3, or 5x5 convolution when you can use ALL of them and let the network decide which scale matters? The Inception module processes input at multiple scales simultaneously and concatenates the results. Think of it as giving the network multiple “magnifying glasses” of different power at every layer.Inception V2/V3: Factorized Convolutions
ResNet: Skip Connections
The Residual Block
ResNet is arguably the single most important architectural innovation in deep learning after the original convolutional neural network. The insight is counterintuitive but profound: instead of asking each block to learn a complete transformation , ask it to learn only the difference from the identity function . Why does this work? If the optimal transformation at some layer is close to the identity (just pass the input through), it is much easier for the network to learn small residual adjustments around zero than to learn the identity mapping from scratch. The skip connection also creates a “gradient highway” that lets gradients flow directly to early layers without passing through many nonlinearities. Instead of learning , learn the residual :Pre-Activation ResNet
DenseNet: Feature Reuse
Dense Connections
Where ResNet adds a single shortcut from a block’s input to its output, DenseNet takes the idea to its logical extreme: every layer connects to every other layer within a block. Think of it like a group chat where every participant has heard everything everyone else has said — no information is ever lost or forgotten. This maximal connectivity means early features (edges, textures) are directly available to late layers (which detect objects), so the network never has to “re-learn” low-level features it already computed. The practical upshot: DenseNet achieves competitive accuracy with far fewer parameters than ResNet, because feature reuse eliminates redundancy. The downside is higher memory consumption during training due to all those concatenated feature maps. Each layer receives input from ALL preceding layers:EfficientNet: Compound Scaling
The Scaling Problem
Before EfficientNet, practitioners scaled CNNs by independently increasing one dimension at a time — making the network deeper (more layers), wider (more channels), or feeding it higher-resolution images. But these dimensions are not independent. A wider network can capture more fine-grained patterns, but only if the resolution is high enough to contain them. A deeper network can build richer hierarchies, but only if the width provides enough capacity at each level. Scaling one dimension while ignoring others hits diminishing returns quickly. EfficientNet’s key insight, from Mingxing Tan and Quoc Le at Google (2019), is compound scaling: scale all three dimensions together using a single coefficient. The analogy is straightforward — if you want a bigger picture, you do not just make it taller or wider; you scale both dimensions proportionally. Width: More channels per layer Depth: More layers Resolution: Higher input resolution EfficientNet scales all three together with compound coefficient : Subject to:ResNeXt: Cardinality
Split-Transform-Merge
ResNeXt introduced a third scaling dimension beyond depth and width: cardinality — the number of independent transformation paths within a block. The analogy is a team of specialists versus a single generalist. Instead of one wide convolution that must handle everything, ResNeXt runs 32 narrower convolutions in parallel, each specializing in different feature patterns, then merges their outputs. In practice, cardinality is implemented efficiently using grouped convolutions (thegroups parameter in PyTorch), which partition input channels into groups processed independently. ResNeXt-50 (32x4d) matches ResNet-101 accuracy with roughly half the computation, because increasing cardinality is more parameter-efficient than increasing depth or width.
Increase “cardinality” (number of parallel paths) instead of depth/width:
Architecture Comparison
Exercises
Exercise 1: Implement SENet
Exercise 1: Implement SENet
Add Squeeze-and-Excitation to any architecture:
Exercise 2: Build RegNet
Exercise 2: Build RegNet
Implement RegNet with its simple design rules:
Exercise 3: Architecture Search
Exercise 3: Architecture Search
Run a simple grid search over architecture hyperparameters:
What’s Next?
Sequence-to-Sequence
Encoder-decoder architectures for sequences
Self-Supervised Learning
Learn representations without labels
Interview Deep-Dive
Why did ResNet's skip connections enable training networks hundreds of layers deep, when VGG struggled beyond 19 layers?
Why did ResNet's skip connections enable training networks hundreds of layers deep, when VGG struggled beyond 19 layers?
Strong Answer:VGG’s approach was “just stack more 3x3 convolutions.” This works up to a point, but VGG-19 was already showing diminishing returns, and experiments with VGG-22 and deeper showed that adding more layers actually decreased accuracy. This “degradation problem” was surprising because a deeper network can always represent the same function as a shallower one. The problem is optimization, not representation.Theoretically, ResNet’s skip connections transform the optimization landscape. Instead of each layer needing to learn a complete transformation h(x), it only needs to learn the residual F(x) = h(x) - x. Learning a residual is easier because the identity mapping is already the default. The loss landscape of residual networks has been shown to be much smoother — fewer local minima and saddle points, wider valleys around good solutions.Practically, the gradient flow story is equally important. In VGG, gradients must pass through 19 weight-activation pairs, each potentially attenuating the signal. ResNet’s skip connections provide a direct gradient highway: the gradient to any layer includes a term that flows through the identity shortcuts without any multiplicative attenuation. Even if the residual branches have poor gradients in early training, the identity path keeps information flowing.The ensemble interpretation adds another lens: a ResNet with N blocks can be viewed as an implicit ensemble of 2^N sub-networks (each corresponding to a different subset of residual blocks being “active”). This gives ResNets a built-in regularization effect similar to dropout.Follow-up: If skip connections are so beneficial, why not connect every layer to every other layer?That is exactly what DenseNet does, and it works well for moderate depths. DenseNet concatenates all previous layer outputs as input to each layer, maximizing feature reuse and gradient flow. However, it creates a memory problem: by layer L, the input is the concatenation of all L-1 previous feature maps, so memory usage grows quadratically. For very deep networks or high-resolution inputs, this becomes prohibitive. In practice, ResNet’s additive residual has won for most use cases because it is simple, memory-efficient, and scales to thousands of layers. DenseNet is preferred in domains like medical imaging where feature reuse is critical and training sets are small.
Explain the EfficientNet compound scaling approach. Why is it better than scaling depth, width, or resolution independently?
Explain the EfficientNet compound scaling approach. Why is it better than scaling depth, width, or resolution independently?
Strong Answer:EfficientNet’s key insight from Tan and Le (2019) is that network depth, width, and input resolution should be scaled together in a principled ratio. They formalized this with a compound scaling coefficient phi: depth scales as alpha^phi, width as beta^phi, and resolution as gamma^phi, where alpha * beta^2 * gamma^2 is approximately 2 (so each increase in phi roughly doubles the FLOPs).Why scaling independently fails: if you only increase depth, you run into vanishing gradients and diminishing accuracy returns (ResNet-1001 is barely better than ResNet-152). If you only increase width, you have more channels but insufficient depth to compose features and resolution too low to represent fine details. If you only increase resolution, you have more spatial detail but not enough model capacity to process it. Each dimension has diminishing returns when scaled alone because the three are interdependent.Intuitively, a higher resolution image contains more fine-grained detail, which requires a wider network to capture the increased information content and a deeper network to compose finer features into higher-level concepts. The practical result: EfficientNet-B7 achieved better accuracy than the best prior CNN while being 8.4x smaller and 6.1x faster.Follow-up: EfficientNet was state-of-the-art in 2019. What replaced it and why?Vision Transformers and their derivatives (DeiT, Swin) have largely overtaken EfficientNet on major benchmarks, especially with large-scale pretraining. Attention-based architectures have a fundamental advantage in modeling global dependencies from early layers, while CNNs grow their receptive field slowly with depth. However, EfficientNet and EfficientNetV2 remain competitive for edge deployment because transformers require more compute at smaller model sizes. ConvNeXt (2022) showed that a pure CNN borrowing design ideas from transformers (larger kernels, GELU, fewer norms) can match Swin Transformer, suggesting the gap is about design choices as much as attention itself.
You need to choose a backbone for a real-time object detection system on a mobile phone. Walk me through your architecture selection process.
You need to choose a backbone for a real-time object detection system on a mobile phone. Walk me through your architecture selection process.
Strong Answer:The constraints define the decision: mobile phone means roughly 2-4 TOPS of neural engine compute, under 50MB model storage, under 20ms inference for 30 FPS real-time, and battery efficiency.Step one: define the accuracy floor with the product team. If mAP below 0.45 makes the feature useless, that eliminates the smallest models.Step two: benchmark on the actual target hardware. Paper FLOPs and device latency are poorly correlated due to memory bandwidth, operator fusion, and hardware-specific optimizations. Profile MobileNetV2, MobileNetV3, EfficientNet-Lite B0-B2, and MNASNet using the platform’s benchmark tools (TFLite benchmark, Core ML). Measure wall-clock latency, not FLOPs.Step three: plot the accuracy-latency Pareto frontier and pick the model meeting both accuracy floor and latency budget. MobileNetV3-Small or EfficientNet-Lite-B0 are often the sweet spots.Step four: optimize aggressively. Quantize to INT8 (2-3x speedup, less than 1% accuracy loss on these architectures). Convert to platform-native format and enable hardware acceleration.Step five: test edge cases the benchmarks miss — low light, motion blur, unusual angles, small objects — where mobile backbones fail first due to reduced capacity.Follow-up: MobileNetV3-Small is 5ms over your latency budget after INT8 quantization. What are your options?Three options in order of preference: (1) Reduce input resolution from 320x320 to 256x256 — quadratic effect on latency (roughly 0.64x) at the cost of small object detection. (2) Channel pruning to remove least-important channels, requiring a small labeled dataset for importance measurement. (3) Knowledge distillation to train a custom smaller architecture using MobileNetV3-Small as the teacher. Avoid switching to FP16 on mobile — on many neural engines, INT8 is already the fastest path and FP16 can be slower due to hardware-specific execution units.