Convolutional Neural Networks
Why Images Need Special Treatment
A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels. If we connected this to a fully connected layer with 1000 neurons:- 150,528 × 1000 = 150 million parameters in ONE layer!
- Ignores spatial structure (neighboring pixels are related)
- Overfits easily
- Computationally expensive
- Local connectivity: Each neuron only sees a small patch
- Weight sharing: Same filter applied across entire image
- Translation invariance: Cat is a cat, regardless of position
The Convolution Operation
Intuition: A Sliding Window Detector
Imagine sliding a small “template” (filter/kernel) across an image, computing similarity at each position:Mathematical Definition
For a 2D convolution:Common Filter Types
Edge Detection
Blur and Sharpen
CNN Building Blocks
Convolutional Layer
- parameters per filter
- total parameters
Pooling Layers
Pooling reduces spatial dimensions while keeping important features:Building a Complete CNN
LeNet-5 Style Architecture
Training the CNN
Visualizing What CNNs Learn
Filter Visualization
Feature Map Visualization
Key CNN Concepts
Stride and Padding
| Concept | Effect | Formula |
|---|---|---|
| Stride | How many pixels to skip | Output = (Input - Kernel + 2×Padding) / Stride + 1 |
| Padding | Zeros added around input | Keeps spatial dimensions with padding=kernel//2 |
| Valid | No padding | Output shrinks |
| Same | Pad to maintain size | Output = Input (when stride=1) |
Receptive Field
The receptive field is the region of input that affects a particular output neuron.Classic CNN Architectures
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First successful CNN | 5 |
| AlexNet | 2012 | ReLU, Dropout, GPU | 8 |
| VGGNet | 2014 | Small 3×3 filters | 16-19 |
| GoogLeNet | 2014 | Inception modules | 22 |
| ResNet | 2015 | Skip connections | 50-152 |
| DenseNet | 2017 | Dense connections | 121-264 |
| EfficientNet | 2019 | Compound scaling | variable |
VGG-16 Implementation
Exercises
Exercise 1: Custom Filters
Exercise 1: Custom Filters
Implement and apply these classic filters:
- Gaussian blur (5×5)
- Laplacian edge detector
- Custom “cross” pattern detector
Exercise 2: Output Size Calculator
Exercise 2: Output Size Calculator
Write a function that computes output dimensions for any sequence of conv and pool layers:
Exercise 3: CIFAR-10 CNN
Exercise 3: CIFAR-10 CNN
Build a CNN for CIFAR-10 (32×32 color images, 10 classes):
- Design architecture to achieve >85% accuracy
- Use batch normalization and dropout
- Visualize learned filters and feature maps
- Analyze which classes are confused
Exercise 4: Depthwise Separable Convolutions
Exercise 4: Depthwise Separable Convolutions
Implement depthwise separable convolutions (used in MobileNet):
- Depthwise: one filter per input channel
- Pointwise: 1×1 convolution to mix channels
Key Takeaways
| Concept | Key Insight |
|---|---|
| Convolution | Local patterns, shared weights |
| Pooling | Downsample, add invariance |
| Stride | Skip pixels, reduce dimensions |
| Padding | Control output size |
| Feature hierarchy | Edges → Shapes → Parts → Objects |
| Receptive field | What input region affects output |