Convolutional Neural Networks
Why Images Need Special Treatment
The Convolution Operation
Intuition: A Sliding Window Detector
Mathematical Definition
Common Filter Types
Edge Detection
Blur and Sharpen
CNN Building Blocks
Convolutional Layer
Pooling Layers
Building a Complete CNN
LeNet-5 Style Architecture
Training the CNN
Visualizing What CNNs Learn
Filter Visualization
Feature Map Visualization
Key CNN Concepts
Stride and Padding
Receptive Field
Classic CNN Architectures
VGG-16 Implementation
Exercises
Key Takeaways
What’s Next

Convolutional Neural Networks

Why Images Need Special Treatment

A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels. If we connected this to a fully connected layer with 1000 neurons:

150,528 × 1000 = 150 million parameters in ONE layer!
Ignores spatial structure (neighboring pixels are related)
Overfits easily
Computationally expensive

CNNs solve this by:

Local connectivity: Each neuron only sees a small patch
Weight sharing: Same filter applied across entire image
Translation invariance: Cat is a cat, regardless of position

The Convolution Operation

Intuition: A Sliding Window Detector

Imagine sliding a small “template” (filter/kernel) across an image, computing similarity at each position:

Image:                    Filter:           Output:
[1 2 3 4 5]              [1 0 -1]          [? ? ?]
[2 3 4 5 6]                                
[3 4 5 6 7]        →          →            Feature Map
[4 5 6 7 8]
[5 6 7 8 9]

At each position, we compute:

\sum_{i,j} \text{Image}_{i,j} \cdot \text{Filter}_{i,j}

Mathematical Definition

For a 2D convolution:

(I * K)[i, j] = \sum_{m}\sum_{n} I[i+m, j+n] \cdot K[m, n]

import numpy as np

def conv2d_naive(image, kernel, stride=1, padding=0):
    """
    Simple 2D convolution implementation.
    
    Args:
        image: Input image (H, W)
        kernel: Convolution kernel (kH, kW)
        stride: Step size
        padding: Zero padding around image
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    H, W = image.shape
    kH, kW = kernel.shape
    
    # Output dimensions
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            # Extract patch
            patch = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            # Compute dot product
            output[i, j] = np.sum(patch * kernel)
    
    return output


# Example: Edge detection
image = np.array([
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
])

# Vertical edge detector
edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

result = conv2d_naive(image, edge_kernel)
print("Input shape:", image.shape)
print("Output shape:", result.shape)
print("Output:\n", result)

Common Filter Types

Edge Detection

import matplotlib.pyplot as plt
from scipy import ndimage

# Sobel filters for edges
sobel_x = np.array([[-1, 0, 1],
                    [-2, 0, 2],
                    [-1, 0, 1]])

sobel_y = np.array([[-1, -2, -1],
                    [ 0,  0,  0],
                    [ 1,  2,  1]])

# Load sample image
from skimage import data
from skimage.color import rgb2gray

image = rgb2gray(data.camera())

# Apply filters
edges_x = ndimage.convolve(image, sobel_x)
edges_y = ndimage.convolve(image, sobel_y)
edges_magnitude = np.sqrt(edges_x**2 + edges_y**2)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(np.abs(edges_x), cmap='gray')
axes[1].set_title('Vertical Edges')
axes[2].imshow(np.abs(edges_y), cmap='gray')
axes[2].set_title('Horizontal Edges')
axes[3].imshow(edges_magnitude, cmap='gray')
axes[3].set_title('Edge Magnitude')
plt.show()

Blur and Sharpen

# Gaussian blur (smoothing)
gaussian = np.array([[1, 2, 1],
                     [2, 4, 2],
                     [1, 2, 1]]) / 16

# Sharpening
sharpen = np.array([[ 0, -1,  0],
                    [-1,  5, -1],
                    [ 0, -1,  0]])

# Emboss
emboss = np.array([[-2, -1, 0],
                   [-1,  1, 1],
                   [ 0,  1, 2]])

CNN Building Blocks

Convolutional Layer

import torch
import torch.nn as nn

# Convolutional layer
conv = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=32,    # 32 filters/kernels
    kernel_size=3,      # 3×3 filters
    stride=1,           # Move 1 pixel at a time
    padding=1           # Preserve spatial dimensions
)

# Input: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv.parameters()):,}")

Parameter count:

(k_H \times k_W \times C_{in} + 1) \times C_{out}

$3 \times 3 \times 3 + 1 = 28$ parameters per filter
$28 \times 32 = 896$ total parameters

Compare to fully connected:

224 \times 224 \times 3 \times 32 = 4.8

million!

Pooling Layers

Pooling reduces spatial dimensions while keeping important features:

# Max pooling - takes maximum in each window
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling - takes average in each window
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling - reduces to single value per channel
global_avg_pool = nn.AdaptiveAvgPool2d(1)

x = torch.randn(1, 32, 224, 224)

print(f"Input: {x.shape}")
print(f"After 2×2 max pool: {max_pool(x).shape}")
print(f"After global avg pool: {global_avg_pool(x).shape}")

Building a Complete CNN

LeNet-5 Style Architecture

class SimpleCNN(nn.Module):
    """Simple CNN for MNIST digit classification."""
    
    def __init__(self):
        super().__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        
        # Pooling
        self.pool = nn.MaxPool2d(2, 2)
        
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 3 * 3, 64)
        self.fc2 = nn.Linear(64, 10)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Conv block 1: 28×28 → 14×14
        x = self.pool(torch.relu(self.conv1(x)))
        
        # Conv block 2: 14×14 → 7×7
        x = self.pool(torch.relu(self.conv2(x)))
        
        # Conv block 3: 7×7 → 3×3
        x = self.pool(torch.relu(self.conv3(x)))
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # Fully connected
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x


model = SimpleCNN()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Training the CNN

import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    
    for data, target in loader:
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# Train for a few epochs
for epoch in range(5):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Accuracy = {train_acc:.2f}%")

Visualizing What CNNs Learn

Filter Visualization

def visualize_filters(model):
    """Visualize the learned filters in the first conv layer."""
    # Get first conv layer weights
    weights = model.conv1.weight.data.cpu()
    
    # Normalize for visualization
    weights = (weights - weights.min()) / (weights.max() - weights.min())
    
    # Plot
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < weights.shape[0]:
            ax.imshow(weights[i, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters (First Conv Layer)')
    plt.show()

visualize_filters(model)

Feature Map Visualization

def visualize_feature_maps(model, image):
    """Visualize feature maps at different layers."""
    activations = []
    
    def hook_fn(module, input, output):
        activations.append(output.detach())
    
    # Register hooks
    hooks = [
        model.conv1.register_forward_hook(hook_fn),
        model.conv2.register_forward_hook(hook_fn),
        model.conv3.register_forward_hook(hook_fn),
    ]
    
    # Forward pass
    with torch.no_grad():
        model(image.unsqueeze(0))
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    # Plot feature maps
    fig, axes = plt.subplots(3, 8, figsize=(16, 6))
    
    for layer_idx, acts in enumerate(activations):
        for i in range(8):
            ax = axes[layer_idx, i]
            ax.imshow(acts[0, i].cpu(), cmap='viridis')
            ax.axis('off')
    
    axes[0, 0].set_ylabel('Conv1', rotation=90, fontsize=12)
    axes[1, 0].set_ylabel('Conv2', rotation=90, fontsize=12)
    axes[2, 0].set_ylabel('Conv3', rotation=90, fontsize=12)
    
    plt.suptitle('Feature Maps at Different Layers')
    plt.show()

# Visualize for a sample image
sample_image = test_data[0][0]
visualize_feature_maps(model, sample_image.to(device))

Key CNN Concepts

Stride and Padding

Concept	Effect	Formula
Stride	How many pixels to skip	Output = (Input - Kernel + 2×Padding) / Stride + 1
Padding	Zeros added around input	Keeps spatial dimensions with `padding=kernel//2`
Valid	No padding	Output shrinks
Same	Pad to maintain size	Output = Input (when stride=1)

# Different stride and padding effects
x = torch.randn(1, 1, 32, 32)

conv_valid = nn.Conv2d(1, 1, kernel_size=5, padding=0)  # 32→28
conv_same = nn.Conv2d(1, 1, kernel_size=5, padding=2)   # 32→32
conv_stride = nn.Conv2d(1, 1, kernel_size=5, padding=2, stride=2)  # 32→16

print(f"Valid: {x.shape} → {conv_valid(x).shape}")
print(f"Same:  {x.shape} → {conv_same(x).shape}")
print(f"Stride 2: {x.shape} → {conv_stride(x).shape}")

Receptive Field

The receptive field is the region of input that affects a particular output neuron.

def compute_receptive_field(layers):
    """
    Compute receptive field for a stack of conv layers.
    
    Args:
        layers: List of (kernel_size, stride) tuples
    """
    r = 1  # Start with 1×1 receptive field
    
    for kernel_size, stride in reversed(layers):
        r = stride * r + (kernel_size - 1)
    
    return r

# Example: VGG-style stack
layers = [
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
]

print(f"Receptive field: {compute_receptive_field(layers)}×{compute_receptive_field(layers)}")

Classic CNN Architectures

Architecture	Year	Key Innovation	Depth
LeNet-5	1998	First successful CNN	5
AlexNet	2012	ReLU, Dropout, GPU	8
VGGNet	2014	Small 3×3 filters	16-19
GoogLeNet	2014	Inception modules	22
ResNet	2015	Skip connections	50-152
DenseNet	2017	Dense connections	121-264
EfficientNet	2019	Compound scaling	variable

VGG-16 Implementation

class VGG16(nn.Module):
    """Simplified VGG-16 architecture."""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 4
            nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 5
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

Exercises

Exercise 1: Custom Filters

Implement and apply these classic filters:

Gaussian blur (5×5)
Laplacian edge detector
Custom “cross” pattern detector

Apply them to real images and visualize results.

Exercise 2: Output Size Calculator

Write a function that computes output dimensions for any sequence of conv and pool layers:

def compute_output_size(input_size, layers):
    """
    layers: list of dicts with keys:
        'type': 'conv' or 'pool'
        'kernel': kernel size
        'stride': stride
        'padding': padding
    """
    pass

Exercise 3: CIFAR-10 CNN

Build a CNN for CIFAR-10 (32×32 color images, 10 classes):

Design architecture to achieve >85% accuracy
Use batch normalization and dropout
Visualize learned filters and feature maps
Analyze which classes are confused

Exercise 4: Depthwise Separable Convolutions

Implement depthwise separable convolutions (used in MobileNet):

Depthwise: one filter per input channel
Pointwise: 1×1 convolution to mix channels

Compare parameters and speed to standard convolutions.

Key Takeaways

Concept	Key Insight
Convolution	Local patterns, shared weights
Pooling	Downsample, add invariance
Stride	Skip pixels, reduce dimensions
Padding	Control output size
Feature hierarchy	Edges → Shapes → Parts → Objects
Receptive field	What input region affects output

What’s Next

Module 7: Pooling, Stride & CNN Design

Build modern CNN architectures — VGG, ResNet, EfficientNet design principles.

Loss Functions CNN Design

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Convolutional Neural Networks

​Why Images Need Special Treatment

​The Convolution Operation

​Intuition: A Sliding Window Detector

​Mathematical Definition

​Common Filter Types

​Edge Detection

​Blur and Sharpen

​CNN Building Blocks

​Convolutional Layer

​Pooling Layers

​Building a Complete CNN

​LeNet-5 Style Architecture

​Training the CNN

​Visualizing What CNNs Learn

​Filter Visualization

​Feature Map Visualization

​Key CNN Concepts

​Stride and Padding

​Receptive Field

​Classic CNN Architectures

​VGG-16 Implementation

​Exercises

​Key Takeaways

​What’s Next