> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Convolutional Neural Networks

> The architecture that revolutionized computer vision - convolutions, filters, and feature maps

<Frame>
  <img src="https://mintcdn.com/devweeekends/0kwJwOL2KCwg2YYu/images/courses/deep-learning-mastery/cnn-concept.svg?fit=max&auto=format&n=0kwJwOL2KCwg2YYu&q=85&s=6d9ac8fd3f686bc6b216ff4e4e96c851" alt="Convolutional Neural Networks" width="1080" height="1080" data-path="images/courses/deep-learning-mastery/cnn-concept.svg" />
</Frame>

# Convolutional Neural Networks

## Why Images Need Special Treatment

A 224×224 RGB image has 224 × 224 × 3 = **150,528 pixels**.

If we connected this to a fully connected layer with 1000 neurons:

* 150,528 × 1000 = **150 million parameters** in ONE layer!
* Ignores spatial structure (neighboring pixels are related)
* Overfits easily
* Computationally expensive

**CNNs solve this** by:

1. **Local connectivity**: Each neuron only sees a small patch (a cat's ear doesn't depend on what's in the bottom-right corner)
2. **Weight sharing**: Same filter applied across entire image (an edge detector that works in the top-left should work anywhere)
3. **Translation invariance**: Cat is a cat, regardless of position (the learned patterns are position-independent)

Think of it like proofreading a document. A fully connected network reads the entire page at once, trying to understand every letter in relation to every other letter. A CNN uses a magnifying glass that slides across the page, looking for local patterns: misspellings, grammatical errors, formatting issues. The same magnifying glass works everywhere on the page -- you don't need a separate one for each paragraph. This is why CNNs are so parameter-efficient for spatial data.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/fc-vs-conv.svg" alt="Fully Connected vs Convolutional" />
</Frame>

***

## The Convolution Operation

### Intuition: A Sliding Window Detector

Imagine sliding a small "template" (filter/kernel) across an image, computing similarity at each position:

```
Image:                    Filter:           Output:
[1 2 3 4 5]              [1 0 -1]          [? ? ?]
[2 3 4 5 6]                                
[3 4 5 6 7]        →          →            Feature Map
[4 5 6 7 8]
[5 6 7 8 9]
```

At each position, we compute: $\sum_{i,j} \text{Image}_{i,j} \cdot \text{Filter}_{i,j}$

### Mathematical Definition

For a 2D convolution:

$$
(I * K)[i, j] = \sum_{m}\sum_{n} I[i+m, j+n] \cdot K[m, n]
$$

```python theme={null}
import numpy as np

def conv2d_naive(image, kernel, stride=1, padding=0):
    """
    Simple 2D convolution implementation.
    
    Args:
        image: Input image (H, W)
        kernel: Convolution kernel (kH, kW)
        stride: Step size
        padding: Zero padding around image
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    H, W = image.shape
    kH, kW = kernel.shape
    
    # Output dimensions
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            # Extract patch
            patch = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            # Compute dot product
            output[i, j] = np.sum(patch * kernel)
    
    return output


# Example: Edge detection
image = np.array([
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
])

# Vertical edge detector
edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

result = conv2d_naive(image, edge_kernel)
print("Input shape:", image.shape)
print("Output shape:", result.shape)
print("Output:\n", result)
```

***

## Common Filter Types

### Edge Detection

```python theme={null}
import matplotlib.pyplot as plt
from scipy import ndimage

# Sobel filters for edges
sobel_x = np.array([[-1, 0, 1],
                    [-2, 0, 2],
                    [-1, 0, 1]])

sobel_y = np.array([[-1, -2, -1],
                    [ 0,  0,  0],
                    [ 1,  2,  1]])

# Load sample image
from skimage import data
from skimage.color import rgb2gray

image = rgb2gray(data.camera())

# Apply filters
edges_x = ndimage.convolve(image, sobel_x)
edges_y = ndimage.convolve(image, sobel_y)
edges_magnitude = np.sqrt(edges_x**2 + edges_y**2)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(np.abs(edges_x), cmap='gray')
axes[1].set_title('Vertical Edges')
axes[2].imshow(np.abs(edges_y), cmap='gray')
axes[2].set_title('Horizontal Edges')
axes[3].imshow(edges_magnitude, cmap='gray')
axes[3].set_title('Edge Magnitude')
plt.show()
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/edge-detection-filters.svg" alt="Edge Detection Filters" />
</Frame>

### Blur and Sharpen

```python theme={null}
# Gaussian blur (smoothing)
gaussian = np.array([[1, 2, 1],
                     [2, 4, 2],
                     [1, 2, 1]]) / 16

# Sharpening
sharpen = np.array([[ 0, -1,  0],
                    [-1,  5, -1],
                    [ 0, -1,  0]])

# Emboss
emboss = np.array([[-2, -1, 0],
                   [-1,  1, 1],
                   [ 0,  1, 2]])
```

***

## CNN Building Blocks

### Convolutional Layer

```python theme={null}
import torch
import torch.nn as nn

# Convolutional layer -- the core building block of every CNN
conv = nn.Conv2d(
    in_channels=3,      # RGB input (3 color channels)
    out_channels=32,    # 32 different filters, each learning a different pattern
    kernel_size=3,      # 3x3 filters -- small enough for local patterns, large enough for edges
    stride=1,           # Move 1 pixel at a time (no skipping)
    padding=1           # Add 1 pixel of zeros around the border to preserve spatial dimensions
    # Why padding=1 with kernel=3? Output = (input - 3 + 2*1)/1 + 1 = input. Dimensions preserved.
)

# Input: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv.parameters()):,}")
```

**Parameter count**: $(k_H \times k_W \times C_{in} + 1) \times C_{out}$

* $3 \times 3 \times 3 + 1 = 28$ parameters per filter
* $28 \times 32 = 896$ total parameters

Compare to fully connected: $224 \times 224 \times 3 \times 32 = 4.8$ million!

### Pooling Layers

Pooling reduces spatial dimensions while keeping important features. Think of it as summarizing: instead of reporting every detail, you report the highlights.

```python theme={null}
# Max pooling - takes maximum in each 2x2 window
# Intuition: "Is this feature present ANYWHERE in this region?"
# A strong edge activation in any of the 4 pixels survives; the rest are discarded.
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling - takes average in each window
# Intuition: "How strongly is this feature present ON AVERAGE in this region?"
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling - reduces entire feature map to a single value per channel
# This replaces the fully connected layers at the end of modern CNNs
global_avg_pool = nn.AdaptiveAvgPool2d(1)

x = torch.randn(1, 32, 224, 224)

print(f"Input: {x.shape}")
print(f"After 2×2 max pool: {max_pool(x).shape}")
print(f"After global avg pool: {global_avg_pool(x).shape}")
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/pooling-operations.svg" alt="Pooling Operations" />
</Frame>

***

## Building a Complete CNN

### LeNet-5 Style Architecture

```python theme={null}
class SimpleCNN(nn.Module):
    """Simple CNN for MNIST digit classification."""
    
    def __init__(self):
        super().__init__()
        
        # Convolutional layers -- each learns increasingly abstract features
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)   # 1 input channel (grayscale), 32 filters
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)  # 32->64: doubling channels is a common pattern
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)  # 64->64: same channels, more depth
        
        # Pooling -- halves spatial dimensions each time
        self.pool = nn.MaxPool2d(2, 2)
        
        # Fully connected layers -- map spatial features to class predictions
        # 64 channels * 3*3 spatial = 576 features after three pooling operations
        self.fc1 = nn.Linear(64 * 3 * 3, 64)
        self.fc2 = nn.Linear(64, 10)  # 10 output classes (digits 0-9)
        
        # Dropout: randomly zero 50% of FC neurons during training
        # This forces the network to learn redundant representations -- no single neuron can be a bottleneck
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Conv block 1: 28x28 -> 14x14 (learns edges and simple strokes)
        x = self.pool(torch.relu(self.conv1(x)))
        
        # Conv block 2: 14x14 -> 7x7 (learns combinations of edges: curves, corners)
        x = self.pool(torch.relu(self.conv2(x)))
        
        # Conv block 3: 7x7 -> 3x3 (learns digit-specific parts: loops, lines)
        x = self.pool(torch.relu(self.conv3(x)))
        
        # Flatten: collapse spatial dimensions into a single vector
        x = x.view(x.size(0), -1)  # (batch, 64, 3, 3) -> (batch, 576)
        
        # Fully connected classifier
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # Only active during training (model.train())
        x = self.fc2(x)      # Raw logits -- no softmax here
        
        return x


model = SimpleCNN()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
```

### Training the CNN

```python theme={null}
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Training setup
# Move model to GPU if available -- CNNs benefit enormously from GPU parallelism
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()  # Expects raw logits, applies softmax internally
optimizer = optim.Adam(model.parameters(), lr=0.001)  # 0.001 is the safe default for Adam

# Training loop -- identical pattern to MLPs, just with images instead of flat vectors
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()  # Enable dropout and batch norm training mode
    total_loss = 0
    correct = 0
    
    for data, target in loader:
        data, target = data.to(device), target.to(device)  # Move data to same device as model
        
        optimizer.zero_grad()       # Clear accumulated gradients
        output = model(data)        # Forward: image -> class logits
        loss = criterion(output, target)  # Compute cross-entropy loss
        loss.backward()             # Backward: compute gradients
        optimizer.step()            # Update weights
        
        total_loss += loss.item()
        pred = output.argmax(dim=1)  # Predicted class = highest logit
        correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# Train for a few epochs
for epoch in range(5):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Accuracy = {train_acc:.2f}%")
```

***

## Visualizing What CNNs Learn

### Filter Visualization

```python theme={null}
def visualize_filters(model):
    """Visualize the learned filters in the first conv layer."""
    # Get first conv layer weights
    weights = model.conv1.weight.data.cpu()
    
    # Normalize for visualization
    weights = (weights - weights.min()) / (weights.max() - weights.min())
    
    # Plot
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < weights.shape[0]:
            ax.imshow(weights[i, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters (First Conv Layer)')
    plt.show()

visualize_filters(model)
```

### Feature Map Visualization

```python theme={null}
def visualize_feature_maps(model, image):
    """Visualize feature maps at different layers."""
    activations = []
    
    def hook_fn(module, input, output):
        activations.append(output.detach())
    
    # Register hooks
    hooks = [
        model.conv1.register_forward_hook(hook_fn),
        model.conv2.register_forward_hook(hook_fn),
        model.conv3.register_forward_hook(hook_fn),
    ]
    
    # Forward pass
    with torch.no_grad():
        model(image.unsqueeze(0))
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    # Plot feature maps
    fig, axes = plt.subplots(3, 8, figsize=(16, 6))
    
    for layer_idx, acts in enumerate(activations):
        for i in range(8):
            ax = axes[layer_idx, i]
            ax.imshow(acts[0, i].cpu(), cmap='viridis')
            ax.axis('off')
    
    axes[0, 0].set_ylabel('Conv1', rotation=90, fontsize=12)
    axes[1, 0].set_ylabel('Conv2', rotation=90, fontsize=12)
    axes[2, 0].set_ylabel('Conv3', rotation=90, fontsize=12)
    
    plt.suptitle('Feature Maps at Different Layers')
    plt.show()

# Visualize for a sample image
sample_image = test_data[0][0]
visualize_feature_maps(model, sample_image.to(device))
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/devweeekends/images/courses/deep-learning-mastery/feature-maps.svg" alt="Feature Maps Visualization" />
</Frame>

***

## Key CNN Concepts

### Stride and Padding

| Concept     | Effect                   | Formula                                            |
| ----------- | ------------------------ | -------------------------------------------------- |
| **Stride**  | How many pixels to skip  | Output = (Input - Kernel + 2×Padding) / Stride + 1 |
| **Padding** | Zeros added around input | Keeps spatial dimensions with `padding=kernel//2`  |
| **Valid**   | No padding               | Output shrinks                                     |
| **Same**    | Pad to maintain size     | Output = Input (when stride=1)                     |

```python theme={null}
# Different stride and padding effects
x = torch.randn(1, 1, 32, 32)

conv_valid = nn.Conv2d(1, 1, kernel_size=5, padding=0)  # 32→28
conv_same = nn.Conv2d(1, 1, kernel_size=5, padding=2)   # 32→32
conv_stride = nn.Conv2d(1, 1, kernel_size=5, padding=2, stride=2)  # 32→16

print(f"Valid: {x.shape} → {conv_valid(x).shape}")
print(f"Same:  {x.shape} → {conv_same(x).shape}")
print(f"Stride 2: {x.shape} → {conv_stride(x).shape}")
```

### Receptive Field

The **receptive field** is the region of input that affects a particular output neuron. This is one of the most important concepts in CNN design.

Think of it as "how much of the original image can this neuron see?" A neuron in the first conv layer sees a 3x3 patch. A neuron two layers deep effectively sees a 5x5 patch (because it combines outputs from overlapping 3x3 patches). A neuron at the end of a deep CNN might "see" the entire image. If your receptive field is too small for the patterns you need to detect (say, you need to recognize a full face but your receptive field only covers an eye), the network will struggle -- no single neuron can integrate enough context.

```python theme={null}
def compute_receptive_field(layers):
    """
    Compute receptive field for a stack of conv layers.
    
    Args:
        layers: List of (kernel_size, stride) tuples
    """
    r = 1  # Start with 1×1 receptive field
    
    for kernel_size, stride in reversed(layers):
        r = stride * r + (kernel_size - 1)
    
    return r

# Example: VGG-style stack
layers = [
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
]

print(f"Receptive field: {compute_receptive_field(layers)}×{compute_receptive_field(layers)}")
```

***

## Classic CNN Architectures

| Architecture     | Year | Key Innovation       | Depth    |
| ---------------- | ---- | -------------------- | -------- |
| **LeNet-5**      | 1998 | First successful CNN | 5        |
| **AlexNet**      | 2012 | ReLU, Dropout, GPU   | 8        |
| **VGGNet**       | 2014 | Small 3×3 filters    | 16-19    |
| **GoogLeNet**    | 2014 | Inception modules    | 22       |
| **ResNet**       | 2015 | Skip connections     | 50-152   |
| **DenseNet**     | 2017 | Dense connections    | 121-264  |
| **EfficientNet** | 2019 | Compound scaling     | variable |

### VGG-16 Implementation

```python theme={null}
class VGG16(nn.Module):
    """Simplified VGG-16 architecture."""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 4
            nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 5
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x
```

***

## Exercises

<AccordionGroup>
  <Accordion title="Exercise 1: Custom Filters">
    Implement and apply these classic filters:

    1. Gaussian blur (5×5)
    2. Laplacian edge detector
    3. Custom "cross" pattern detector

    Apply them to real images and visualize results.
  </Accordion>

  <Accordion title="Exercise 2: Output Size Calculator">
    Write a function that computes output dimensions for any sequence of conv and pool layers:

    ```python theme={null}
    def compute_output_size(input_size, layers):
        """
        layers: list of dicts with keys:
            'type': 'conv' or 'pool'
            'kernel': kernel size
            'stride': stride
            'padding': padding
        """
        pass
    ```
  </Accordion>

  <Accordion title="Exercise 3: CIFAR-10 CNN">
    Build a CNN for CIFAR-10 (32×32 color images, 10 classes):

    1. Design architecture to achieve >85% accuracy
    2. Use batch normalization and dropout
    3. Visualize learned filters and feature maps
    4. Analyze which classes are confused
  </Accordion>

  <Accordion title="Exercise 4: Depthwise Separable Convolutions">
    Implement depthwise separable convolutions (used in MobileNet):

    1. Depthwise: one filter per input channel
    2. Pointwise: 1×1 convolution to mix channels

    Compare parameters and speed to standard convolutions.
  </Accordion>
</AccordionGroup>

***

## Key Takeaways

| Concept               | Key Insight                      |
| --------------------- | -------------------------------- |
| **Convolution**       | Local patterns, shared weights   |
| **Pooling**           | Downsample, add invariance       |
| **Stride**            | Skip pixels, reduce dimensions   |
| **Padding**           | Control output size              |
| **Feature hierarchy** | Edges → Shapes → Parts → Objects |
| **Receptive field**   | What input region affects output |

***

## What's Next

<CardGroup cols={1}>
  <Card title="Module 7: Pooling, Stride & CNN Design" icon="compress-alt" href="/courses/deep-learning-mastery/07-cnn-design">
    Build modern CNN architectures — VGG, ResNet, EfficientNet design principles.
  </Card>
</CardGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Explain why CNNs use weight sharing and local connectivity. What inductive biases do these encode, and when do they fail?">
    **Strong Answer:**

    * **Weight sharing** means the same convolutional filter is applied at every spatial position. This encodes the inductive bias of translation equivariance: a feature detector that recognizes a cat ear in the top-left should recognize it in the bottom-right. Without weight sharing, the network would need to independently learn the same pattern for every possible location, requiring far more parameters and data.
    * **Local connectivity** means each neuron only connects to a small spatial patch (the receptive field), not the entire image. This encodes the locality bias: nearby pixels are more related than distant ones. An edge at position $(10, 10)$ depends on pixels at $(9,9)$ through $(11,11)$, not on a pixel at $(200, 200)$.
    * Together, these biases reduce the parameter count by orders of magnitude: a 3x3 convolution with 32 filters on a 224x224 image uses 896 parameters instead of 150 million for a fully connected layer. This massive reduction acts as strong regularization, preventing overfitting.
    * **When they fail**: (1) Tasks requiring global context, like counting objects across the entire image -- local receptive fields miss long-range dependencies. (2) Data where the same pattern at different locations has different meanings -- medical images where a lesion on the left lung has different significance than on the right. (3) Non-grid-structured data -- graphs, point clouds, or sets of variable-length items. This is precisely why Vision Transformers, which have global attention without locality bias, can outperform CNNs when sufficient data is available to learn spatial relationships from scratch.

    **Follow-up: ViTs learn to attend to any position -- does this mean locality bias is unnecessary?**

    Not unnecessary, just learnable at a cost. ViTs need much more data (14M+ images for ViT-B) to learn the spatial relationships that CNNs encode for free. With sufficient data, ViTs discover locality patterns similar to CNNs in early layers (attention maps show local attention) and global patterns in later layers. Hybrid architectures like Swin Transformer reintroduce locality through windowed attention, achieving CNN-like data efficiency while maintaining the global reasoning capability for later layers. The trend is toward architectures that use locality bias where it helps (early layers) and global attention where it helps (later layers).
  </Accordion>

  <Accordion title="What is the receptive field of a CNN, and why is it critical for architecture design?">
    **Strong Answer:**

    * The receptive field is the region of the input image that can influence a particular neuron's output. A neuron in the first 3x3 conv layer has a 3x3 receptive field. After stacking two 3x3 conv layers, the effective receptive field is 5x5. After three, it is 7x7.
    * **Why it is critical**: the receptive field determines what scale of features the network can detect. To recognize a face (roughly 100x100 pixels in a typical image), the final convolutional features must have a receptive field of at least 100x100. If the receptive field is only 50x50, no single neuron can "see" the entire face, and the network must rely on the fully connected layers to integrate partial information.
    * **Computing receptive field**: for a stack of $n$ layers with kernel size $k$ and stride $s$, the receptive field grows as $r_{l} = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i$. Pooling layers with stride 2 are powerful receptive field amplifiers because every subsequent layer's kernel covers twice as much input space.
    * **VGG's insight**: two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters ($2 \times 3^2 = 18$ vs. $5^2 = 25$) and more non-linearity (two ReLU activations instead of one). Three 3x3 convolutions match a 7x7 filter ($3 \times 9 = 27$ vs. $49$) with even more savings. This is why modern CNNs exclusively use 3x3 kernels stacked deeply.
    * The **effective receptive field** (what the neuron actually "attends to") is typically much smaller than the theoretical receptive field because weights near the center have more influence than those at the edges. This follows a Gaussian distribution, not a uniform one.

    **Follow-up: How do dilated (atrous) convolutions expand the receptive field without increasing parameters?**

    Dilated convolutions insert gaps between kernel elements. A 3x3 kernel with dilation 2 covers a 5x5 area (with gaps), providing a 5x5 receptive field with only 9 parameters. Stacking dilated convolutions with exponentially increasing dilation rates (1, 2, 4, 8, 16) creates very large receptive fields without pooling -- preserving spatial resolution. This is essential for tasks like semantic segmentation where you need both global context (large receptive field) and pixel-level precision (no downsampling). The trade-off is that dilated convolutions can create "gridding artifacts" where the gaps in the kernel cause aliasing. DeepLab addresses this with ASPP (Atrous Spatial Pyramid Pooling), which uses multiple dilation rates in parallel and aggregates the results.
  </Accordion>

  <Accordion title="Calculate the output dimensions and parameter count for a Conv2d layer given specific input dimensions. Walk through the math.">
    **Strong Answer:**

    * **Output spatial dimensions**: $H_{out} = \lfloor (H_{in} + 2P - K) / S \rfloor + 1$, where $H_{in}$ is input height, $P$ is padding, $K$ is kernel size, $S$ is stride. Same formula for width.
    * **Concrete example**: input $(B, 3, 224, 224)$, Conv2d(3, 64, kernel\_size=7, stride=2, padding=3):
      * $H_{out} = \lfloor (224 + 2 \times 3 - 7) / 2 \rfloor + 1 = \lfloor 223/2 \rfloor + 1 = 112$
      * Output shape: $(B, 64, 112, 112)$
    * **Parameter count**: $(K_H \times K_W \times C_{in} + 1) \times C_{out}$ (the +1 is for the bias per filter).
      * $(7 \times 7 \times 3 + 1) \times 64 = 148 \times 64 = 9,\!472$ parameters.
      * Without bias: $7 \times 7 \times 3 \times 64 = 9,\!408$. Many modern architectures use `bias=False` when followed by batch normalization, since BN's learned shift parameter ($\beta$) makes the bias redundant.
    * **Compute cost (FLOPs)**: roughly $2 \times K^2 \times C_{in} \times C_{out} \times H_{out} \times W_{out}$ (multiply-accumulate operations). For our example: $2 \times 49 \times 3 \times 64 \times 112 \times 112 \approx 236M$ FLOPs. This single layer accounts for a significant fraction of a ResNet's total compute because of the large spatial dimensions.

    **Follow-up: Why is it common to double the number of channels when halving spatial dimensions (e.g., 64 channels at 56x56, 128 at 28x28)?**

    This design principle (from VGG and adopted by ResNet) keeps the total "information capacity" roughly constant across layers. When spatial dimensions are halved by stride-2 convolution or pooling, the number of spatial positions drops by 4x ($56 \times 56 = 3136$ vs. $28 \times 28 = 784$). Doubling the channel count partially compensates, keeping the total number of activations ($C \times H \times W$) from dropping too drastically. If channels were not increased, later layers would represent the input in progressively lower-dimensional spaces, creating information bottlenecks. Conversely, increasing channels beyond 2x would make later layers disproportionately expensive. The 2x rule is a practical sweet spot between information preservation and computational efficiency.
  </Accordion>
</AccordionGroup>
