Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Convolutional Neural Networks
Why Images Need Special Treatment
A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels. If we connected this to a fully connected layer with 1000 neurons:- 150,528 × 1000 = 150 million parameters in ONE layer!
- Ignores spatial structure (neighboring pixels are related)
- Overfits easily
- Computationally expensive
- Local connectivity: Each neuron only sees a small patch (a cat’s ear doesn’t depend on what’s in the bottom-right corner)
- Weight sharing: Same filter applied across entire image (an edge detector that works in the top-left should work anywhere)
- Translation invariance: Cat is a cat, regardless of position (the learned patterns are position-independent)
The Convolution Operation
Intuition: A Sliding Window Detector
Imagine sliding a small “template” (filter/kernel) across an image, computing similarity at each position:Mathematical Definition
For a 2D convolution:Common Filter Types
Edge Detection
Blur and Sharpen
CNN Building Blocks
Convolutional Layer
- parameters per filter
- total parameters
Pooling Layers
Pooling reduces spatial dimensions while keeping important features. Think of it as summarizing: instead of reporting every detail, you report the highlights.Building a Complete CNN
LeNet-5 Style Architecture
Training the CNN
Visualizing What CNNs Learn
Filter Visualization
Feature Map Visualization
Key CNN Concepts
Stride and Padding
| Concept | Effect | Formula |
|---|---|---|
| Stride | How many pixels to skip | Output = (Input - Kernel + 2×Padding) / Stride + 1 |
| Padding | Zeros added around input | Keeps spatial dimensions with padding=kernel//2 |
| Valid | No padding | Output shrinks |
| Same | Pad to maintain size | Output = Input (when stride=1) |
Receptive Field
The receptive field is the region of input that affects a particular output neuron. This is one of the most important concepts in CNN design. Think of it as “how much of the original image can this neuron see?” A neuron in the first conv layer sees a 3x3 patch. A neuron two layers deep effectively sees a 5x5 patch (because it combines outputs from overlapping 3x3 patches). A neuron at the end of a deep CNN might “see” the entire image. If your receptive field is too small for the patterns you need to detect (say, you need to recognize a full face but your receptive field only covers an eye), the network will struggle — no single neuron can integrate enough context.Classic CNN Architectures
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First successful CNN | 5 |
| AlexNet | 2012 | ReLU, Dropout, GPU | 8 |
| VGGNet | 2014 | Small 3×3 filters | 16-19 |
| GoogLeNet | 2014 | Inception modules | 22 |
| ResNet | 2015 | Skip connections | 50-152 |
| DenseNet | 2017 | Dense connections | 121-264 |
| EfficientNet | 2019 | Compound scaling | variable |
VGG-16 Implementation
Exercises
Exercise 1: Custom Filters
Exercise 1: Custom Filters
Implement and apply these classic filters:
- Gaussian blur (5×5)
- Laplacian edge detector
- Custom “cross” pattern detector
Exercise 2: Output Size Calculator
Exercise 2: Output Size Calculator
Write a function that computes output dimensions for any sequence of conv and pool layers:
Exercise 3: CIFAR-10 CNN
Exercise 3: CIFAR-10 CNN
Build a CNN for CIFAR-10 (32×32 color images, 10 classes):
- Design architecture to achieve >85% accuracy
- Use batch normalization and dropout
- Visualize learned filters and feature maps
- Analyze which classes are confused
Exercise 4: Depthwise Separable Convolutions
Exercise 4: Depthwise Separable Convolutions
Implement depthwise separable convolutions (used in MobileNet):
- Depthwise: one filter per input channel
- Pointwise: 1×1 convolution to mix channels
Key Takeaways
| Concept | Key Insight |
|---|---|
| Convolution | Local patterns, shared weights |
| Pooling | Downsample, add invariance |
| Stride | Skip pixels, reduce dimensions |
| Padding | Control output size |
| Feature hierarchy | Edges → Shapes → Parts → Objects |
| Receptive field | What input region affects output |
What’s Next
Module 7: Pooling, Stride & CNN Design
Build modern CNN architectures — VGG, ResNet, EfficientNet design principles.
Interview Deep-Dive
Explain why CNNs use weight sharing and local connectivity. What inductive biases do these encode, and when do they fail?
Explain why CNNs use weight sharing and local connectivity. What inductive biases do these encode, and when do they fail?
Strong Answer:
- Weight sharing means the same convolutional filter is applied at every spatial position. This encodes the inductive bias of translation equivariance: a feature detector that recognizes a cat ear in the top-left should recognize it in the bottom-right. Without weight sharing, the network would need to independently learn the same pattern for every possible location, requiring far more parameters and data.
- Local connectivity means each neuron only connects to a small spatial patch (the receptive field), not the entire image. This encodes the locality bias: nearby pixels are more related than distant ones. An edge at position depends on pixels at through , not on a pixel at .
- Together, these biases reduce the parameter count by orders of magnitude: a 3x3 convolution with 32 filters on a 224x224 image uses 896 parameters instead of 150 million for a fully connected layer. This massive reduction acts as strong regularization, preventing overfitting.
- When they fail: (1) Tasks requiring global context, like counting objects across the entire image — local receptive fields miss long-range dependencies. (2) Data where the same pattern at different locations has different meanings — medical images where a lesion on the left lung has different significance than on the right. (3) Non-grid-structured data — graphs, point clouds, or sets of variable-length items. This is precisely why Vision Transformers, which have global attention without locality bias, can outperform CNNs when sufficient data is available to learn spatial relationships from scratch.
What is the receptive field of a CNN, and why is it critical for architecture design?
What is the receptive field of a CNN, and why is it critical for architecture design?
Strong Answer:
- The receptive field is the region of the input image that can influence a particular neuron’s output. A neuron in the first 3x3 conv layer has a 3x3 receptive field. After stacking two 3x3 conv layers, the effective receptive field is 5x5. After three, it is 7x7.
- Why it is critical: the receptive field determines what scale of features the network can detect. To recognize a face (roughly 100x100 pixels in a typical image), the final convolutional features must have a receptive field of at least 100x100. If the receptive field is only 50x50, no single neuron can “see” the entire face, and the network must rely on the fully connected layers to integrate partial information.
- Computing receptive field: for a stack of layers with kernel size and stride , the receptive field grows as . Pooling layers with stride 2 are powerful receptive field amplifiers because every subsequent layer’s kernel covers twice as much input space.
- VGG’s insight: two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters ( vs. ) and more non-linearity (two ReLU activations instead of one). Three 3x3 convolutions match a 7x7 filter ( vs. ) with even more savings. This is why modern CNNs exclusively use 3x3 kernels stacked deeply.
- The effective receptive field (what the neuron actually “attends to”) is typically much smaller than the theoretical receptive field because weights near the center have more influence than those at the edges. This follows a Gaussian distribution, not a uniform one.
Calculate the output dimensions and parameter count for a Conv2d layer given specific input dimensions. Walk through the math.
Calculate the output dimensions and parameter count for a Conv2d layer given specific input dimensions. Walk through the math.
Strong Answer:
- Output spatial dimensions: , where is input height, is padding, is kernel size, is stride. Same formula for width.
- Concrete example: input , Conv2d(3, 64, kernel_size=7, stride=2, padding=3):
- Output shape:
- Parameter count: (the +1 is for the bias per filter).
- parameters.
- Without bias: . Many modern architectures use
bias=Falsewhen followed by batch normalization, since BN’s learned shift parameter () makes the bias redundant.
- Compute cost (FLOPs): roughly (multiply-accumulate operations). For our example: FLOPs. This single layer accounts for a significant fraction of a ResNet’s total compute because of the large spatial dimensions.