Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Convolutional Neural Networks

Convolutional Neural Networks

Why Images Need Special Treatment

A 224×224 RGB image has 224 × 224 × 3 = 150,528 pixels. If we connected this to a fully connected layer with 1000 neurons:
  • 150,528 × 1000 = 150 million parameters in ONE layer!
  • Ignores spatial structure (neighboring pixels are related)
  • Overfits easily
  • Computationally expensive
CNNs solve this by:
  1. Local connectivity: Each neuron only sees a small patch (a cat’s ear doesn’t depend on what’s in the bottom-right corner)
  2. Weight sharing: Same filter applied across entire image (an edge detector that works in the top-left should work anywhere)
  3. Translation invariance: Cat is a cat, regardless of position (the learned patterns are position-independent)
Think of it like proofreading a document. A fully connected network reads the entire page at once, trying to understand every letter in relation to every other letter. A CNN uses a magnifying glass that slides across the page, looking for local patterns: misspellings, grammatical errors, formatting issues. The same magnifying glass works everywhere on the page — you don’t need a separate one for each paragraph. This is why CNNs are so parameter-efficient for spatial data.
Fully Connected vs Convolutional

The Convolution Operation

Intuition: A Sliding Window Detector

Imagine sliding a small “template” (filter/kernel) across an image, computing similarity at each position:
Image:                    Filter:           Output:
[1 2 3 4 5]              [1 0 -1]          [? ? ?]
[2 3 4 5 6]                                
[3 4 5 6 7]        →          →            Feature Map
[4 5 6 7 8]
[5 6 7 8 9]
At each position, we compute: i,jImagei,jFilteri,j\sum_{i,j} \text{Image}_{i,j} \cdot \text{Filter}_{i,j}

Mathematical Definition

For a 2D convolution: (IK)[i,j]=mnI[i+m,j+n]K[m,n](I * K)[i, j] = \sum_{m}\sum_{n} I[i+m, j+n] \cdot K[m, n]
import numpy as np

def conv2d_naive(image, kernel, stride=1, padding=0):
    """
    Simple 2D convolution implementation.
    
    Args:
        image: Input image (H, W)
        kernel: Convolution kernel (kH, kW)
        stride: Step size
        padding: Zero padding around image
    """
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant')
    
    H, W = image.shape
    kH, kW = kernel.shape
    
    # Output dimensions
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    
    output = np.zeros((out_H, out_W))
    
    for i in range(out_H):
        for j in range(out_W):
            # Extract patch
            patch = image[i*stride:i*stride+kH, j*stride:j*stride+kW]
            # Compute dot product
            output[i, j] = np.sum(patch * kernel)
    
    return output


# Example: Edge detection
image = np.array([
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 1, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0, 0],
])

# Vertical edge detector
edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

result = conv2d_naive(image, edge_kernel)
print("Input shape:", image.shape)
print("Output shape:", result.shape)
print("Output:\n", result)

Common Filter Types

Edge Detection

import matplotlib.pyplot as plt
from scipy import ndimage

# Sobel filters for edges
sobel_x = np.array([[-1, 0, 1],
                    [-2, 0, 2],
                    [-1, 0, 1]])

sobel_y = np.array([[-1, -2, -1],
                    [ 0,  0,  0],
                    [ 1,  2,  1]])

# Load sample image
from skimage import data
from skimage.color import rgb2gray

image = rgb2gray(data.camera())

# Apply filters
edges_x = ndimage.convolve(image, sobel_x)
edges_y = ndimage.convolve(image, sobel_y)
edges_magnitude = np.sqrt(edges_x**2 + edges_y**2)

fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[1].imshow(np.abs(edges_x), cmap='gray')
axes[1].set_title('Vertical Edges')
axes[2].imshow(np.abs(edges_y), cmap='gray')
axes[2].set_title('Horizontal Edges')
axes[3].imshow(edges_magnitude, cmap='gray')
axes[3].set_title('Edge Magnitude')
plt.show()
Edge Detection Filters

Blur and Sharpen

# Gaussian blur (smoothing)
gaussian = np.array([[1, 2, 1],
                     [2, 4, 2],
                     [1, 2, 1]]) / 16

# Sharpening
sharpen = np.array([[ 0, -1,  0],
                    [-1,  5, -1],
                    [ 0, -1,  0]])

# Emboss
emboss = np.array([[-2, -1, 0],
                   [-1,  1, 1],
                   [ 0,  1, 2]])

CNN Building Blocks

Convolutional Layer

import torch
import torch.nn as nn

# Convolutional layer -- the core building block of every CNN
conv = nn.Conv2d(
    in_channels=3,      # RGB input (3 color channels)
    out_channels=32,    # 32 different filters, each learning a different pattern
    kernel_size=3,      # 3x3 filters -- small enough for local patterns, large enough for edges
    stride=1,           # Move 1 pixel at a time (no skipping)
    padding=1           # Add 1 pixel of zeros around the border to preserve spatial dimensions
    # Why padding=1 with kernel=3? Output = (input - 3 + 2*1)/1 + 1 = input. Dimensions preserved.
)

# Input: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in conv.parameters()):,}")
Parameter count: (kH×kW×Cin+1)×Cout(k_H \times k_W \times C_{in} + 1) \times C_{out}
  • 3×3×3+1=283 \times 3 \times 3 + 1 = 28 parameters per filter
  • 28×32=89628 \times 32 = 896 total parameters
Compare to fully connected: 224×224×3×32=4.8224 \times 224 \times 3 \times 32 = 4.8 million!

Pooling Layers

Pooling reduces spatial dimensions while keeping important features. Think of it as summarizing: instead of reporting every detail, you report the highlights.
# Max pooling - takes maximum in each 2x2 window
# Intuition: "Is this feature present ANYWHERE in this region?"
# A strong edge activation in any of the 4 pixels survives; the rest are discarded.
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Average pooling - takes average in each window
# Intuition: "How strongly is this feature present ON AVERAGE in this region?"
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling - reduces entire feature map to a single value per channel
# This replaces the fully connected layers at the end of modern CNNs
global_avg_pool = nn.AdaptiveAvgPool2d(1)

x = torch.randn(1, 32, 224, 224)

print(f"Input: {x.shape}")
print(f"After 2×2 max pool: {max_pool(x).shape}")
print(f"After global avg pool: {global_avg_pool(x).shape}")
Pooling Operations

Building a Complete CNN

LeNet-5 Style Architecture

class SimpleCNN(nn.Module):
    """Simple CNN for MNIST digit classification."""
    
    def __init__(self):
        super().__init__()
        
        # Convolutional layers -- each learns increasingly abstract features
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)   # 1 input channel (grayscale), 32 filters
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)  # 32->64: doubling channels is a common pattern
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)  # 64->64: same channels, more depth
        
        # Pooling -- halves spatial dimensions each time
        self.pool = nn.MaxPool2d(2, 2)
        
        # Fully connected layers -- map spatial features to class predictions
        # 64 channels * 3*3 spatial = 576 features after three pooling operations
        self.fc1 = nn.Linear(64 * 3 * 3, 64)
        self.fc2 = nn.Linear(64, 10)  # 10 output classes (digits 0-9)
        
        # Dropout: randomly zero 50% of FC neurons during training
        # This forces the network to learn redundant representations -- no single neuron can be a bottleneck
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        # Conv block 1: 28x28 -> 14x14 (learns edges and simple strokes)
        x = self.pool(torch.relu(self.conv1(x)))
        
        # Conv block 2: 14x14 -> 7x7 (learns combinations of edges: curves, corners)
        x = self.pool(torch.relu(self.conv2(x)))
        
        # Conv block 3: 7x7 -> 3x3 (learns digit-specific parts: loops, lines)
        x = self.pool(torch.relu(self.conv3(x)))
        
        # Flatten: collapse spatial dimensions into a single vector
        x = x.view(x.size(0), -1)  # (batch, 64, 3, 3) -> (batch, 576)
        
        # Fully connected classifier
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)  # Only active during training (model.train())
        x = self.fc2(x)      # Raw logits -- no softmax here
        
        return x


model = SimpleCNN()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

Training the CNN

import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=1000)

# Training setup
# Move model to GPU if available -- CNNs benefit enormously from GPU parallelism
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()  # Expects raw logits, applies softmax internally
optimizer = optim.Adam(model.parameters(), lr=0.001)  # 0.001 is the safe default for Adam

# Training loop -- identical pattern to MLPs, just with images instead of flat vectors
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()  # Enable dropout and batch norm training mode
    total_loss = 0
    correct = 0
    
    for data, target in loader:
        data, target = data.to(device), target.to(device)  # Move data to same device as model
        
        optimizer.zero_grad()       # Clear accumulated gradients
        output = model(data)        # Forward: image -> class logits
        loss = criterion(output, target)  # Compute cross-entropy loss
        loss.backward()             # Backward: compute gradients
        optimizer.step()            # Update weights
        
        total_loss += loss.item()
        pred = output.argmax(dim=1)  # Predicted class = highest logit
        correct += pred.eq(target).sum().item()
    
    return total_loss / len(loader), 100. * correct / len(loader.dataset)

# Train for a few epochs
for epoch in range(5):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Accuracy = {train_acc:.2f}%")

Visualizing What CNNs Learn

Filter Visualization

def visualize_filters(model):
    """Visualize the learned filters in the first conv layer."""
    # Get first conv layer weights
    weights = model.conv1.weight.data.cpu()
    
    # Normalize for visualization
    weights = (weights - weights.min()) / (weights.max() - weights.min())
    
    # Plot
    fig, axes = plt.subplots(4, 8, figsize=(12, 6))
    for i, ax in enumerate(axes.flat):
        if i < weights.shape[0]:
            ax.imshow(weights[i, 0], cmap='gray')
        ax.axis('off')
    
    plt.suptitle('Learned Filters (First Conv Layer)')
    plt.show()

visualize_filters(model)

Feature Map Visualization

def visualize_feature_maps(model, image):
    """Visualize feature maps at different layers."""
    activations = []
    
    def hook_fn(module, input, output):
        activations.append(output.detach())
    
    # Register hooks
    hooks = [
        model.conv1.register_forward_hook(hook_fn),
        model.conv2.register_forward_hook(hook_fn),
        model.conv3.register_forward_hook(hook_fn),
    ]
    
    # Forward pass
    with torch.no_grad():
        model(image.unsqueeze(0))
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    # Plot feature maps
    fig, axes = plt.subplots(3, 8, figsize=(16, 6))
    
    for layer_idx, acts in enumerate(activations):
        for i in range(8):
            ax = axes[layer_idx, i]
            ax.imshow(acts[0, i].cpu(), cmap='viridis')
            ax.axis('off')
    
    axes[0, 0].set_ylabel('Conv1', rotation=90, fontsize=12)
    axes[1, 0].set_ylabel('Conv2', rotation=90, fontsize=12)
    axes[2, 0].set_ylabel('Conv3', rotation=90, fontsize=12)
    
    plt.suptitle('Feature Maps at Different Layers')
    plt.show()

# Visualize for a sample image
sample_image = test_data[0][0]
visualize_feature_maps(model, sample_image.to(device))
Feature Maps Visualization

Key CNN Concepts

Stride and Padding

ConceptEffectFormula
StrideHow many pixels to skipOutput = (Input - Kernel + 2×Padding) / Stride + 1
PaddingZeros added around inputKeeps spatial dimensions with padding=kernel//2
ValidNo paddingOutput shrinks
SamePad to maintain sizeOutput = Input (when stride=1)
# Different stride and padding effects
x = torch.randn(1, 1, 32, 32)

conv_valid = nn.Conv2d(1, 1, kernel_size=5, padding=0)  # 32→28
conv_same = nn.Conv2d(1, 1, kernel_size=5, padding=2)   # 32→32
conv_stride = nn.Conv2d(1, 1, kernel_size=5, padding=2, stride=2)  # 32→16

print(f"Valid: {x.shape}{conv_valid(x).shape}")
print(f"Same:  {x.shape}{conv_same(x).shape}")
print(f"Stride 2: {x.shape}{conv_stride(x).shape}")

Receptive Field

The receptive field is the region of input that affects a particular output neuron. This is one of the most important concepts in CNN design. Think of it as “how much of the original image can this neuron see?” A neuron in the first conv layer sees a 3x3 patch. A neuron two layers deep effectively sees a 5x5 patch (because it combines outputs from overlapping 3x3 patches). A neuron at the end of a deep CNN might “see” the entire image. If your receptive field is too small for the patterns you need to detect (say, you need to recognize a full face but your receptive field only covers an eye), the network will struggle — no single neuron can integrate enough context.
def compute_receptive_field(layers):
    """
    Compute receptive field for a stack of conv layers.
    
    Args:
        layers: List of (kernel_size, stride) tuples
    """
    r = 1  # Start with 1×1 receptive field
    
    for kernel_size, stride in reversed(layers):
        r = stride * r + (kernel_size - 1)
    
    return r

# Example: VGG-style stack
layers = [
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
    (3, 1),  # Conv 3×3, stride 1
    (3, 1),  # Conv 3×3, stride 1
    (2, 2),  # Pool 2×2, stride 2
]

print(f"Receptive field: {compute_receptive_field(layers)}×{compute_receptive_field(layers)}")

Classic CNN Architectures

ArchitectureYearKey InnovationDepth
LeNet-51998First successful CNN5
AlexNet2012ReLU, Dropout, GPU8
VGGNet2014Small 3×3 filters16-19
GoogLeNet2014Inception modules22
ResNet2015Skip connections50-152
DenseNet2017Dense connections121-264
EfficientNet2019Compound scalingvariable

VGG-16 Implementation

class VGG16(nn.Module):
    """Simplified VGG-16 architecture."""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 2
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 3
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 4
            nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
            
            # Block 5
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2, 2),
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

Exercises

Implement and apply these classic filters:
  1. Gaussian blur (5×5)
  2. Laplacian edge detector
  3. Custom “cross” pattern detector
Apply them to real images and visualize results.
Write a function that computes output dimensions for any sequence of conv and pool layers:
def compute_output_size(input_size, layers):
    """
    layers: list of dicts with keys:
        'type': 'conv' or 'pool'
        'kernel': kernel size
        'stride': stride
        'padding': padding
    """
    pass
Build a CNN for CIFAR-10 (32×32 color images, 10 classes):
  1. Design architecture to achieve >85% accuracy
  2. Use batch normalization and dropout
  3. Visualize learned filters and feature maps
  4. Analyze which classes are confused
Implement depthwise separable convolutions (used in MobileNet):
  1. Depthwise: one filter per input channel
  2. Pointwise: 1×1 convolution to mix channels
Compare parameters and speed to standard convolutions.

Key Takeaways

ConceptKey Insight
ConvolutionLocal patterns, shared weights
PoolingDownsample, add invariance
StrideSkip pixels, reduce dimensions
PaddingControl output size
Feature hierarchyEdges → Shapes → Parts → Objects
Receptive fieldWhat input region affects output

What’s Next

Module 7: Pooling, Stride & CNN Design

Build modern CNN architectures — VGG, ResNet, EfficientNet design principles.

Interview Deep-Dive

Strong Answer:
  • Weight sharing means the same convolutional filter is applied at every spatial position. This encodes the inductive bias of translation equivariance: a feature detector that recognizes a cat ear in the top-left should recognize it in the bottom-right. Without weight sharing, the network would need to independently learn the same pattern for every possible location, requiring far more parameters and data.
  • Local connectivity means each neuron only connects to a small spatial patch (the receptive field), not the entire image. This encodes the locality bias: nearby pixels are more related than distant ones. An edge at position (10,10)(10, 10) depends on pixels at (9,9)(9,9) through (11,11)(11,11), not on a pixel at (200,200)(200, 200).
  • Together, these biases reduce the parameter count by orders of magnitude: a 3x3 convolution with 32 filters on a 224x224 image uses 896 parameters instead of 150 million for a fully connected layer. This massive reduction acts as strong regularization, preventing overfitting.
  • When they fail: (1) Tasks requiring global context, like counting objects across the entire image — local receptive fields miss long-range dependencies. (2) Data where the same pattern at different locations has different meanings — medical images where a lesion on the left lung has different significance than on the right. (3) Non-grid-structured data — graphs, point clouds, or sets of variable-length items. This is precisely why Vision Transformers, which have global attention without locality bias, can outperform CNNs when sufficient data is available to learn spatial relationships from scratch.
Follow-up: ViTs learn to attend to any position — does this mean locality bias is unnecessary?Not unnecessary, just learnable at a cost. ViTs need much more data (14M+ images for ViT-B) to learn the spatial relationships that CNNs encode for free. With sufficient data, ViTs discover locality patterns similar to CNNs in early layers (attention maps show local attention) and global patterns in later layers. Hybrid architectures like Swin Transformer reintroduce locality through windowed attention, achieving CNN-like data efficiency while maintaining the global reasoning capability for later layers. The trend is toward architectures that use locality bias where it helps (early layers) and global attention where it helps (later layers).
Strong Answer:
  • The receptive field is the region of the input image that can influence a particular neuron’s output. A neuron in the first 3x3 conv layer has a 3x3 receptive field. After stacking two 3x3 conv layers, the effective receptive field is 5x5. After three, it is 7x7.
  • Why it is critical: the receptive field determines what scale of features the network can detect. To recognize a face (roughly 100x100 pixels in a typical image), the final convolutional features must have a receptive field of at least 100x100. If the receptive field is only 50x50, no single neuron can “see” the entire face, and the network must rely on the fully connected layers to integrate partial information.
  • Computing receptive field: for a stack of nn layers with kernel size kk and stride ss, the receptive field grows as rl=rl1+(kl1)i=1l1sir_{l} = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i. Pooling layers with stride 2 are powerful receptive field amplifiers because every subsequent layer’s kernel covers twice as much input space.
  • VGG’s insight: two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters (2×32=182 \times 3^2 = 18 vs. 52=255^2 = 25) and more non-linearity (two ReLU activations instead of one). Three 3x3 convolutions match a 7x7 filter (3×9=273 \times 9 = 27 vs. 4949) with even more savings. This is why modern CNNs exclusively use 3x3 kernels stacked deeply.
  • The effective receptive field (what the neuron actually “attends to”) is typically much smaller than the theoretical receptive field because weights near the center have more influence than those at the edges. This follows a Gaussian distribution, not a uniform one.
Follow-up: How do dilated (atrous) convolutions expand the receptive field without increasing parameters?Dilated convolutions insert gaps between kernel elements. A 3x3 kernel with dilation 2 covers a 5x5 area (with gaps), providing a 5x5 receptive field with only 9 parameters. Stacking dilated convolutions with exponentially increasing dilation rates (1, 2, 4, 8, 16) creates very large receptive fields without pooling — preserving spatial resolution. This is essential for tasks like semantic segmentation where you need both global context (large receptive field) and pixel-level precision (no downsampling). The trade-off is that dilated convolutions can create “gridding artifacts” where the gaps in the kernel cause aliasing. DeepLab addresses this with ASPP (Atrous Spatial Pyramid Pooling), which uses multiple dilation rates in parallel and aggregates the results.
Strong Answer:
  • Output spatial dimensions: Hout=(Hin+2PK)/S+1H_{out} = \lfloor (H_{in} + 2P - K) / S \rfloor + 1, where HinH_{in} is input height, PP is padding, KK is kernel size, SS is stride. Same formula for width.
  • Concrete example: input (B,3,224,224)(B, 3, 224, 224), Conv2d(3, 64, kernel_size=7, stride=2, padding=3):
    • Hout=(224+2×37)/2+1=223/2+1=112H_{out} = \lfloor (224 + 2 \times 3 - 7) / 2 \rfloor + 1 = \lfloor 223/2 \rfloor + 1 = 112
    • Output shape: (B,64,112,112)(B, 64, 112, 112)
  • Parameter count: (KH×KW×Cin+1)×Cout(K_H \times K_W \times C_{in} + 1) \times C_{out} (the +1 is for the bias per filter).
    • (7×7×3+1)×64=148×64=9, ⁣472(7 \times 7 \times 3 + 1) \times 64 = 148 \times 64 = 9,\!472 parameters.
    • Without bias: 7×7×3×64=9, ⁣4087 \times 7 \times 3 \times 64 = 9,\!408. Many modern architectures use bias=False when followed by batch normalization, since BN’s learned shift parameter (β\beta) makes the bias redundant.
  • Compute cost (FLOPs): roughly 2×K2×Cin×Cout×Hout×Wout2 \times K^2 \times C_{in} \times C_{out} \times H_{out} \times W_{out} (multiply-accumulate operations). For our example: 2×49×3×64×112×112236M2 \times 49 \times 3 \times 64 \times 112 \times 112 \approx 236M FLOPs. This single layer accounts for a significant fraction of a ResNet’s total compute because of the large spatial dimensions.
Follow-up: Why is it common to double the number of channels when halving spatial dimensions (e.g., 64 channels at 56x56, 128 at 28x28)?This design principle (from VGG and adopted by ResNet) keeps the total “information capacity” roughly constant across layers. When spatial dimensions are halved by stride-2 convolution or pooling, the number of spatial positions drops by 4x (56×56=313656 \times 56 = 3136 vs. 28×28=78428 \times 28 = 784). Doubling the channel count partially compensates, keeping the total number of activations (C×H×WC \times H \times W) from dropping too drastically. If channels were not increased, later layers would represent the input in progressively lower-dimensional spaces, creating information bottlenecks. Conversely, increasing channels beyond 2x would make later layers disproportionately expensive. The 2x rule is a practical sweet spot between information preservation and computational efficiency.