Pooling, Stride & CNN Design
From Building Blocks to Architectures
In the previous chapter, we learned about convolutions - the fundamental operation that makes CNNs powerful. Now we’ll explore how to combine these building blocks into effective architectures. Think of it like LEGO: knowing what a brick is doesn’t make you an architect. Understanding how to stack bricks into stable, beautiful structures does.The Evolution of CNNs: From LeNet’s 5 layers in 1998 to modern architectures with 1000+ layers, CNN design has undergone a remarkable evolution. Each breakthrough taught us new principles about what makes networks learn better.
Deep Dive: Pooling Operations
Why Pooling Matters
Pooling serves three critical purposes:- Dimensionality Reduction: Reduce computational cost and memory
- Translation Invariance: Small shifts in input don’t change output much
- Feature Summarization: Capture the “presence” of a feature, not its exact location
Max Pooling: The Dominant Choice
Max pooling takes the maximum value in each window - like asking “Is this feature present anywhere in this region?”Average Pooling: Smooth Aggregation
Average pooling computes the mean - useful when you care about the overall “intensity” of activations:Global Average Pooling (GAP): The Modern Approach
Global Average Pooling reduces each feature map to a single value - replacing fully connected layers at the end of CNNs:Pooling Comparison Table
| Pooling Type | Formula | Best For | Downsides |
|---|---|---|---|
| Max Pooling | Detecting presence of features | Loses spatial info, gradient to one element only | |
| Average Pooling | Smooth downsampling | May dilute strong activations | |
| Global Average | Classification heads | Loses all spatial info | |
| Strided Conv | Learned | Modern architectures | More parameters |
Understanding Stride
Stride as Downsampling
Stride controls how much the convolution kernel “jumps” between positions:The Output Size Formula
For any convolutional or pooling layer:Stride vs Pooling for Downsampling
Modern architectures often debate: strided convolution or pooling?Classic CNN Architectures
VGGNet: The Power of Depth
VGG’s key insight: Use many small (3×3) filters instead of few large ones. Two 3×3 convolutions have the same receptive field as one 5×5, but with:- Fewer parameters: vs
- More non-linearity: Two ReLU activations instead of one
ResNet: The Skip Connection Revolution
The Degradation Problem
As networks get deeper, they should perform at least as well as shallower ones (the extra layers could just learn identity). But in practice, deeper networks performed worse:The Residual Connection
ResNet’s solution: Instead of learning , learn the residual :Bottleneck Block for Deeper Networks
For ResNet-50/101/152, bottleneck blocks reduce computation:Complete ResNet Implementation
EfficientNet: Compound Scaling
The Scaling Problem
How do you scale a network when you have more compute? Previous approaches:- Width scaling: More channels (WideResNet)
- Depth scaling: More layers (ResNet-152)
- Resolution scaling: Higher input resolution
Scaling Coefficients
Modern CNN Design Principles
Principle 1: Use Small Filters
Principle 2: Batch Normalization Everywhere
Principle 3: Skip Connections for Deep Networks
Principle 4: Global Average Pooling for Classification
Architecture Comparison
When to Use Which Architecture
| Architecture | Best For | Strengths | Weaknesses |
|---|---|---|---|
| VGG | Teaching, feature extraction | Simple, uniform structure | Very large, slow |
| ResNet | General purpose, transfer learning | Robust, well-studied | Large memory footprint |
| DenseNet | Limited data, feature reuse | Parameter efficient | Memory intensive |
| EfficientNet | Best accuracy/compute | State-of-the-art efficiency | Complex to implement |
| MobileNet | Mobile/edge deployment | Extremely fast | Lower accuracy |
Practical Tips for CNN Design
Exercises
Exercise 1: Implement Custom Pooling
Exercise 1: Implement Custom Pooling
Implement these pooling variants and compare them on CIFAR-10:
- Mixed Pooling:
- Stochastic Pooling: Sample based on activation magnitudes
- LP Pooling:
Exercise 2: Build ResNet Variants
Exercise 2: Build ResNet Variants
Implement and compare these ResNet variations:
- Pre-activation ResNet: BN-ReLU-Conv instead of Conv-BN-ReLU
- Wide ResNet: Increase width instead of depth
- ResNeXt: Grouped convolutions in bottleneck
Exercise 3: Analyze Receptive Fields
Exercise 3: Analyze Receptive Fields
Write a function to compute the receptive field of any CNN:Use it to analyze VGG-16, ResNet-50, and EfficientNet-B0.
Exercise 4: Architecture Search
Exercise 4: Architecture Search
Implement a simple neural architecture search:
- Define a search space (number of layers, channels, kernel sizes)
- Use random search to sample architectures
- Train each for 10 epochs on CIFAR-10
- Find the Pareto frontier of accuracy vs parameters
Exercise 5: Transfer Learning Deep Dive
Exercise 5: Transfer Learning Deep Dive
Using a pretrained ResNet-50:
- Fine-tune only the last layer on a new dataset
- Fine-tune all layers with different learning rates
- Extract intermediate features and train a separate classifier
- Compare accuracy, training time, and generalization
Key Takeaways
| Concept | Key Insight |
|---|---|
| Max Pooling | Detects presence of features, provides translation invariance |
| Stride | Learnable downsampling, modern replacement for pooling |
| GAP | Eliminates FC parameters, improves generalization |
| VGGNet | Small filters + depth = large receptive field |
| ResNet | Skip connections enable training very deep networks |
| EfficientNet | Scale depth, width, and resolution together |
| Modern Design | BN everywhere, skip connections, attention, GAP head |