Skip to main content
Convolutional Neural Networks

Convolutional Neural Networks

Why Images Need Special Treatment

A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels. If we connected this to a fully connected layer with 1000 neurons:
  • 150,528 × 1000 = 150 million parameters in ONE layer!
  • Ignores spatial structure (neighboring pixels are related)
  • Overfits easily
  • Computationally expensive
CNNs solve this by:
  1. Local connectivity: Each neuron only sees a small patch
  2. Weight sharing: Same filter applied across entire image
  3. Translation invariance: Cat is a cat, regardless of position
Fully Connected vs Convolutional

The Convolution Operation

Intuition: A Sliding Window Detector

Imagine sliding a small “template” (filter/kernel) across an image, computing similarity at each position:
Image:                    Filter:           Output:
[1 2 3 4 5]              [1 0 -1]          [? ? ?]
[2 3 4 5 6]                                
[3 4 5 6 7]        →          →            Feature Map
[4 5 6 7 8]
[5 6 7 8 9]
At each position, we compute: i,jImagei,jFilteri,j\sum_{i,j} \text{Image}_{i,j} \cdot \text{Filter}_{i,j}

Mathematical Definition

For a 2D convolution: (IK)[i,j]=mnI[i+m,j+n]K[m,n](I * K)[i, j] = \sum_{m}\sum_{n} I[i+m, j+n] \cdot K[m, n]
import numpy as np

def conv2d_naive(image, kernel, stride=1, padding=0):
    """
    Simple 2D convolution implementation.
    
    Args:
        image: Input image (H, W)
        kernel: Convolution kernel (kH, kW)
        stride: Step size
        padding: Zero padding around image
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    H, W = image.shape
    kH, kW = kernel.shape
    
    # Output dimensions
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            # Extract patch
            patch = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            # Compute dot product
            output[i, j] = np.sum(patch * kernel)
    
    return output


# Example: Edge detection
image = np.array([
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
])

# Vertical edge detector
edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

result = conv2d_naive(image, edge_kernel)
print("Input shape:", image.shape)
print("Output shape:", result.shape)
print("Output:\n", result)

Common Filter Types

Edge Detection

import matplotlib.pyplot as plt
from scipy import ndimage

# Sobel filters for edges
sobel_x = np.array([[-1, 0, 1],
                    [-2, 0, 2],
                    [-1, 0, 1]])

sobel_y = np.array([[-1, -2, -1],
                    [ 0,  0,  0],
                    [ 1,  2,  1]])

# Load sample image
from skimage import data
from skimage.color import rgb2gray

image = rgb2gray(data.camera())

# Apply filters
edges_x = ndimage.convolve(image, sobel_x)
edges_y = ndimage.convolve(image, sobel_y)
edges_magnitude = np.sqrt(edges_x**2 + edges_y**2)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(np.abs(edges_x), cmap='gray')
axes[1].set_title('Vertical Edges')
axes[2].imshow(np.abs(edges_y), cmap='gray')
axes[2].set_title('Horizontal Edges')
axes[3].imshow(edges_magnitude, cmap='gray')
axes[3].set_title('Edge Magnitude')
plt.show()
Edge Detection Filters

Blur and Sharpen

# Gaussian blur (smoothing)
gaussian = np.array([[1, 2, 1],
                     [2, 4, 2],
                     [1, 2, 1]]) / 16

# Sharpening
sharpen = np.array([[ 0, -1,  0],
                    [-1,  5, -1],
                    [ 0, -1,  0]])

# Emboss
emboss = np.array([[-2, -1, 0],
                   [-1,  1, 1],
                   [ 0,  1, 2]])

CNN Building Blocks

Convolutional Layer

import torch
import torch.nn as nn

# Convolutional layer
conv = nn.Conv2d(
    in_channels=3,      # RGB input
    out_channels=32,    # 32 filters/kernels
    kernel_size=3,      # 3×3 filters
    stride=1,           # Move 1 pixel at a time
    padding=1           # Preserve spatial dimensions
)

# Input: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv.parameters()):,}")
Parameter count: (kH×kW×Cin+1)×Cout(k_H \times k_W \times C_{in} + 1) \times C_{out}
  • 3×3×3+1=283 \times 3 \times 3 + 1 = 28 parameters per filter
  • 28×32=89628 \times 32 = 896 total parameters
Compare to fully connected: 224×224×3×32=4.8224 \times 224 \times 3 \times 32 = 4.8 million!

Pooling Layers

Pooling reduces spatial dimensions while keeping important features:
# Max pooling - takes maximum in each window
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling - takes average in each window
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling - reduces to single value per channel
global_avg_pool = nn.AdaptiveAvgPool2d(1)

x = torch.randn(1, 32, 224, 224)

print(f"Input: {x.shape}")
print(f"After 2×2 max pool: {max_pool(x).shape}")
print(f"After global avg pool: {global_avg_pool(x).shape}")
Pooling Operations

Building a Complete CNN

LeNet-5 Style Architecture

class SimpleCNN(nn.Module):
    """Simple CNN for MNIST digit classification."""
    
    def __init__(self):
        super().__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        
        # Pooling
        self.pool = nn.MaxPool2d(2, 2)
        
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 3 * 3, 64)
        self.fc2 = nn.Linear(64, 10)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Conv block 1: 28×28 → 14×14
        x = self.pool(torch.relu(self.conv1(x)))
        
        # Conv block 2: 14×14 → 7×7
        x = self.pool(torch.relu(self.conv2(x)))
        
        # Conv block 3: 7×7 → 3×3
        x = self.pool(torch.relu(self.conv3(x)))
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # Fully connected
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x


model = SimpleCNN()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Training the CNN

import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    
    for data, target in loader:
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# Train for a few epochs
for epoch in range(5):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Accuracy = {train_acc:.2f}%")

Visualizing What CNNs Learn

Filter Visualization

def visualize_filters(model):
    """Visualize the learned filters in the first conv layer."""
    # Get first conv layer weights
    weights = model.conv1.weight.data.cpu()
    
    # Normalize for visualization
    weights = (weights - weights.min()) / (weights.max() - weights.min())
    
    # Plot
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < weights.shape[0]:
            ax.imshow(weights[i, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters (First Conv Layer)')
    plt.show()

visualize_filters(model)

Feature Map Visualization

def visualize_feature_maps(model, image):
    """Visualize feature maps at different layers."""
    activations = []
    
    def hook_fn(module, input, output):
        activations.append(output.detach())
    
    # Register hooks
    hooks = [
        model.conv1.register_forward_hook(hook_fn),
        model.conv2.register_forward_hook(hook_fn),
        model.conv3.register_forward_hook(hook_fn),
    ]
    
    # Forward pass
    with torch.no_grad():
        model(image.unsqueeze(0))
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    # Plot feature maps
    fig, axes = plt.subplots(3, 8, figsize=(16, 6))
    
    for layer_idx, acts in enumerate(activations):
        for i in range(8):
            ax = axes[layer_idx, i]
            ax.imshow(acts[0, i].cpu(), cmap='viridis')
            ax.axis('off')
    
    axes[0, 0].set_ylabel('Conv1', rotation=90, fontsize=12)
    axes[1, 0].set_ylabel('Conv2', rotation=90, fontsize=12)
    axes[2, 0].set_ylabel('Conv3', rotation=90, fontsize=12)
    
    plt.suptitle('Feature Maps at Different Layers')
    plt.show()

# Visualize for a sample image
sample_image = test_data[0][0]
visualize_feature_maps(model, sample_image.to(device))
Feature Maps Visualization

Key CNN Concepts

Stride and Padding

ConceptEffectFormula
StrideHow many pixels to skipOutput = (Input - Kernel + 2×Padding) / Stride + 1
PaddingZeros added around inputKeeps spatial dimensions with padding=kernel//2
ValidNo paddingOutput shrinks
SamePad to maintain sizeOutput = Input (when stride=1)
# Different stride and padding effects
x = torch.randn(1, 1, 32, 32)

conv_valid = nn.Conv2d(1, 1, kernel_size=5, padding=0)  # 32→28
conv_same = nn.Conv2d(1, 1, kernel_size=5, padding=2)   # 32→32
conv_stride = nn.Conv2d(1, 1, kernel_size=5, padding=2, stride=2)  # 32→16

print(f"Valid: {x.shape}{conv_valid(x).shape}")
print(f"Same:  {x.shape}{conv_same(x).shape}")
print(f"Stride 2: {x.shape}{conv_stride(x).shape}")

Receptive Field

The receptive field is the region of input that affects a particular output neuron.
def compute_receptive_field(layers):
    """
    Compute receptive field for a stack of conv layers.
    
    Args:
        layers: List of (kernel_size, stride) tuples
    """
    r = 1  # Start with 1×1 receptive field
    
    for kernel_size, stride in reversed(layers):
        r = stride * r + (kernel_size - 1)
    
    return r

# Example: VGG-style stack
layers = [
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
]

print(f"Receptive field: {compute_receptive_field(layers)}×{compute_receptive_field(layers)}")

Classic CNN Architectures

ArchitectureYearKey InnovationDepth
LeNet-51998First successful CNN5
AlexNet2012ReLU, Dropout, GPU8
VGGNet2014Small 3×3 filters16-19
GoogLeNet2014Inception modules22
ResNet2015Skip connections50-152
DenseNet2017Dense connections121-264
EfficientNet2019Compound scalingvariable

VGG-16 Implementation

class VGG16(nn.Module):
    """Simplified VGG-16 architecture."""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 4
            nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 5
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

Exercises

Implement and apply these classic filters:
  1. Gaussian blur (5×5)
  2. Laplacian edge detector
  3. Custom “cross” pattern detector
Apply them to real images and visualize results.
Write a function that computes output dimensions for any sequence of conv and pool layers:
def compute_output_size(input_size, layers):
    """
    layers: list of dicts with keys:
        'type': 'conv' or 'pool'
        'kernel': kernel size
        'stride': stride
        'padding': padding
    """
    pass
Build a CNN for CIFAR-10 (32×32 color images, 10 classes):
  1. Design architecture to achieve >85% accuracy
  2. Use batch normalization and dropout
  3. Visualize learned filters and feature maps
  4. Analyze which classes are confused
Implement depthwise separable convolutions (used in MobileNet):
  1. Depthwise: one filter per input channel
  2. Pointwise: 1×1 convolution to mix channels
Compare parameters and speed to standard convolutions.

Key Takeaways

ConceptKey Insight
ConvolutionLocal patterns, shared weights
PoolingDownsample, add invariance
StrideSkip pixels, reduce dimensions
PaddingControl output size
Feature hierarchyEdges → Shapes → Parts → Objects
Receptive fieldWhat input region affects output

What’s Next