> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Convolutional Neural Networks > The architecture that revolutionized computer vision - convolutions, filters, and feature maps

# Convolutional Neural Networks ## Why Images Need Special Treatment A 224×224 RGB image has 224 × 224 × 3 = **150,528 pixels**. If we connected this to a fully connected layer with 1000 neurons: * 150,528 × 1000 = **150 million parameters** in ONE layer! * Ignores spatial structure (neighboring pixels are related) * Overfits easily * Computationally expensive **CNNs solve this** by: 1. **Local connectivity**: Each neuron only sees a small patch (a cat's ear doesn't depend on what's in the bottom-right corner) 2. **Weight sharing**: Same filter applied across entire image (an edge detector that works in the top-left should work anywhere) 3. **Translation invariance**: Cat is a cat, regardless of position (the learned patterns are position-independent) Think of it like proofreading a document. A fully connected network reads the entire page at once, trying to understand every letter in relation to every other letter. A CNN uses a magnifying glass that slides across the page, looking for local patterns: misspellings, grammatical errors, formatting issues. The same magnifying glass works everywhere on the page -- you don't need a separate one for each paragraph. This is why CNNs are so parameter-efficient for spatial data. Fully Connected vs Convolutional

*** ## The Convolution Operation ### Intuition: A Sliding Window Detector Imagine sliding a small "template" (filter/kernel) across an image, computing similarity at each position: ``` Image: Filter: Output: [1 2 3 4 5] [1 0 -1] [? ? ?] [2 3 4 5 6] [3 4 5 6 7] → → Feature Map [4 5 6 7 8] [5 6 7 8 9] ``` At each position, we compute: $\sum_{i,j} \text{Image}_{i,j} \cdot \text{Filter}_{i,j}$ ### Mathematical Definition For a 2D convolution: $$ (I * K)[i, j] = \sum_{m}\sum_{n} I[i+m, j+n] \cdot K[m, n] $$ ```python theme={null} import numpy as np def conv2d_naive(image, kernel, stride=1, padding=0): """ Simple 2D convolution implementation. Args: image: Input image (H, W) kernel: Convolution kernel (kH, kW) stride: Step size padding: Zero padding around image """ # Add padding if padding > 0: image = np.pad(image, padding, mode='constant') H, W = image.shape kH, kW = kernel.shape # Output dimensions out_H = (H - kH) // stride + 1 out_W = (W - kW) // stride + 1 output = np.zeros((out_H, out_W)) for i in range(out_H): for j in range(out_W): # Extract patch patch = image[i*stride:i*stride+kH, j*stride:j*stride+kW] # Compute dot product output[i, j] = np.sum(patch * kernel) return output # Example: Edge detection image = np.array([ [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 1, 1, 0, 0], [0, 0, 1, 1, 1, 0, 0], [0, 0, 1, 1, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], ]) # Vertical edge detector edge_kernel = np.array([ [-1, 0, 1], [-1, 0, 1], [-1, 0, 1] ]) result = conv2d_naive(image, edge_kernel) print("Input shape:", image.shape) print("Output shape:", result.shape) print("Output:\n", result) ``` *** ## Common Filter Types ### Edge Detection ```python theme={null} import matplotlib.pyplot as plt from scipy import ndimage # Sobel filters for edges sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]) sobel_y = np.array([[-1, -2, -1], [ 0, 0, 0], [ 1, 2, 1]]) # Load sample image from skimage import data from skimage.color import rgb2gray image = rgb2gray(data.camera()) # Apply filters edges_x = ndimage.convolve(image, sobel_x) edges_y = ndimage.convolve(image, sobel_y) edges_magnitude = np.sqrt(edges_x**2 + edges_y**2) fig, axes = plt.subplots(1, 4, figsize=(16, 4)) axes[0].imshow(image, cmap='gray') axes[0].set_title('Original') axes[1].imshow(np.abs(edges_x), cmap='gray') axes[1].set_title('Vertical Edges') axes[2].imshow(np.abs(edges_y), cmap='gray') axes[2].set_title('Horizontal Edges') axes[3].imshow(edges_magnitude, cmap='gray') axes[3].set_title('Edge Magnitude') plt.show() ``` Edge Detection Filters

### Blur and Sharpen ```python theme={null} # Gaussian blur (smoothing) gaussian = np.array([[1, 2, 1], [2, 4, 2], [1, 2, 1]]) / 16 # Sharpening sharpen = np.array([[ 0, -1, 0], [-1, 5, -1], [ 0, -1, 0]]) # Emboss emboss = np.array([[-2, -1, 0], [-1, 1, 1], [ 0, 1, 2]]) ``` *** ## CNN Building Blocks ### Convolutional Layer ```python theme={null} import torch import torch.nn as nn # Convolutional layer -- the core building block of every CNN conv = nn.Conv2d( in_channels=3, # RGB input (3 color channels) out_channels=32, # 32 different filters, each learning a different pattern kernel_size=3, # 3x3 filters -- small enough for local patterns, large enough for edges stride=1, # Move 1 pixel at a time (no skipping) padding=1 # Add 1 pixel of zeros around the border to preserve spatial dimensions # Why padding=1 with kernel=3? Output = (input - 3 + 2*1)/1 + 1 = input. Dimensions preserved. ) # Input: (batch, channels, height, width) x = torch.randn(1, 3, 224, 224) output = conv(x) print(f"Input shape: {x.shape}") print(f"Output shape: {output.shape}") print(f"Number of parameters: {sum(p.numel() for p in conv.parameters()):,}") ``` **Parameter count**: $(k_H \times k_W \times C_{in} + 1) \times C_{out}$ * $3 \times 3 \times 3 + 1 = 28$ parameters per filter * $28 \times 32 = 896$ total parameters Compare to fully connected: $224 \times 224 \times 3 \times 32 = 4.8$ million! ### Pooling Layers Pooling reduces spatial dimensions while keeping important features. Think of it as summarizing: instead of reporting every detail, you report the highlights. ```python theme={null} # Max pooling - takes maximum in each 2x2 window # Intuition: "Is this feature present ANYWHERE in this region?" # A strong edge activation in any of the 4 pixels survives; the rest are discarded. max_pool = nn.MaxPool2d(kernel_size=2, stride=2) # Average pooling - takes average in each window # Intuition: "How strongly is this feature present ON AVERAGE in this region?" avg_pool = nn.AvgPool2d(kernel_size=2, stride=2) # Global average pooling - reduces entire feature map to a single value per channel # This replaces the fully connected layers at the end of modern CNNs global_avg_pool = nn.AdaptiveAvgPool2d(1) x = torch.randn(1, 32, 224, 224) print(f"Input: {x.shape}") print(f"After 2×2 max pool: {max_pool(x).shape}") print(f"After global avg pool: {global_avg_pool(x).shape}") ```

*** ## Building a Complete CNN ### LeNet-5 Style Architecture ```python theme={null} class SimpleCNN(nn.Module): """Simple CNN for MNIST digit classification.""" def __init__(self): super().__init__() # Convolutional layers -- each learns increasingly abstract features self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1) # 1 input channel (grayscale), 32 filters self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) # 32->64: doubling channels is a common pattern self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1) # 64->64: same channels, more depth # Pooling -- halves spatial dimensions each time self.pool = nn.MaxPool2d(2, 2) # Fully connected layers -- map spatial features to class predictions # 64 channels * 3*3 spatial = 576 features after three pooling operations self.fc1 = nn.Linear(64 * 3 * 3, 64) self.fc2 = nn.Linear(64, 10) # 10 output classes (digits 0-9) # Dropout: randomly zero 50% of FC neurons during training # This forces the network to learn redundant representations -- no single neuron can be a bottleneck self.dropout = nn.Dropout(0.5) def forward(self, x): # Conv block 1: 28x28 -> 14x14 (learns edges and simple strokes) x = self.pool(torch.relu(self.conv1(x))) # Conv block 2: 14x14 -> 7x7 (learns combinations of edges: curves, corners) x = self.pool(torch.relu(self.conv2(x))) # Conv block 3: 7x7 -> 3x3 (learns digit-specific parts: loops, lines) x = self.pool(torch.relu(self.conv3(x))) # Flatten: collapse spatial dimensions into a single vector x = x.view(x.size(0), -1) # (batch, 64, 3, 3) -> (batch, 576) # Fully connected classifier x = torch.relu(self.fc1(x)) x = self.dropout(x) # Only active during training (model.train()) x = self.fc2(x) # Raw logits -- no softmax here return x model = SimpleCNN() print(model) print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}") ``` ### Training the CNN ```python theme={null} import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader # Data loading transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) train_data = datasets.MNIST('./data', train=True, download=True, transform=transform) test_data = datasets.MNIST('./data', train=False, transform=transform) train_loader = DataLoader(train_data, batch_size=64, shuffle=True) test_loader = DataLoader(test_data, batch_size=1000) # Training setup # Move model to GPU if available -- CNNs benefit enormously from GPU parallelism device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = SimpleCNN().to(device) criterion = nn.CrossEntropyLoss() # Expects raw logits, applies softmax internally optimizer = optim.Adam(model.parameters(), lr=0.001) # 0.001 is the safe default for Adam # Training loop -- identical pattern to MLPs, just with images instead of flat vectors def train_epoch(model, loader, criterion, optimizer, device): model.train() # Enable dropout and batch norm training mode total_loss = 0 correct = 0 for data, target in loader: data, target = data.to(device), target.to(device) # Move data to same device as model optimizer.zero_grad() # Clear accumulated gradients output = model(data) # Forward: image -> class logits loss = criterion(output, target) # Compute cross-entropy loss loss.backward() # Backward: compute gradients optimizer.step() # Update weights total_loss += loss.item() pred = output.argmax(dim=1) # Predicted class = highest logit correct += pred.eq(target).sum().item() return total_loss / len(loader), 100. * correct / len(loader.dataset) # Train for a few epochs for epoch in range(5): train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device) print(f"Epoch {epoch+1}: Loss = {train_loss:.4f}, Accuracy = {train_acc:.2f}%") ``` *** ## Visualizing What CNNs Learn ### Filter Visualization ```python theme={null} def visualize_filters(model): """Visualize the learned filters in the first conv layer.""" # Get first conv layer weights weights = model.conv1.weight.data.cpu() # Normalize for visualization weights = (weights - weights.min()) / (weights.max() - weights.min()) # Plot fig, axes = plt.subplots(4, 8, figsize=(12, 6)) for i, ax in enumerate(axes.flat): if i < weights.shape[0]: ax.imshow(weights[i, 0], cmap='gray') ax.axis('off') plt.suptitle('Learned Filters (First Conv Layer)') plt.show() visualize_filters(model) ``` ### Feature Map Visualization ```python theme={null} def visualize_feature_maps(model, image): """Visualize feature maps at different layers.""" activations = [] def hook_fn(module, input, output): activations.append(output.detach()) # Register hooks hooks = [ model.conv1.register_forward_hook(hook_fn), model.conv2.register_forward_hook(hook_fn), model.conv3.register_forward_hook(hook_fn), ] # Forward pass with torch.no_grad(): model(image.unsqueeze(0)) # Remove hooks for hook in hooks: hook.remove() # Plot feature maps fig, axes = plt.subplots(3, 8, figsize=(16, 6)) for layer_idx, acts in enumerate(activations): for i in range(8): ax = axes[layer_idx, i] ax.imshow(acts[0, i].cpu(), cmap='viridis') ax.axis('off') axes[0, 0].set_ylabel('Conv1', rotation=90, fontsize=12) axes[1, 0].set_ylabel('Conv2', rotation=90, fontsize=12) axes[2, 0].set_ylabel('Conv3', rotation=90, fontsize=12) plt.suptitle('Feature Maps at Different Layers') plt.show() # Visualize for a sample image sample_image = test_data[0][0] visualize_feature_maps(model, sample_image.to(device)) ``` Feature Maps Visualization

*** ## Key CNN Concepts ### Stride and Padding | Concept | Effect | Formula | | ----------- | ------------------------ | -------------------------------------------------- | | **Stride** | How many pixels to skip | Output = (Input - Kernel + 2×Padding) / Stride + 1 | | **Padding** | Zeros added around input | Keeps spatial dimensions with `padding=kernel//2` | | **Valid** | No padding | Output shrinks | | **Same** | Pad to maintain size | Output = Input (when stride=1) | ```python theme={null} # Different stride and padding effects x = torch.randn(1, 1, 32, 32) conv_valid = nn.Conv2d(1, 1, kernel_size=5, padding=0) # 32→28 conv_same = nn.Conv2d(1, 1, kernel_size=5, padding=2) # 32→32 conv_stride = nn.Conv2d(1, 1, kernel_size=5, padding=2, stride=2) # 32→16 print(f"Valid: {x.shape} → {conv_valid(x).shape}") print(f"Same: {x.shape} → {conv_same(x).shape}") print(f"Stride 2: {x.shape} → {conv_stride(x).shape}") ``` ### Receptive Field The **receptive field** is the region of input that affects a particular output neuron. This is one of the most important concepts in CNN design. Think of it as "how much of the original image can this neuron see?" A neuron in the first conv layer sees a 3x3 patch. A neuron two layers deep effectively sees a 5x5 patch (because it combines outputs from overlapping 3x3 patches). A neuron at the end of a deep CNN might "see" the entire image. If your receptive field is too small for the patterns you need to detect (say, you need to recognize a full face but your receptive field only covers an eye), the network will struggle -- no single neuron can integrate enough context. ```python theme={null} def compute_receptive_field(layers): """ Compute receptive field for a stack of conv layers. Args: layers: List of (kernel_size, stride) tuples """ r = 1 # Start with 1×1 receptive field for kernel_size, stride in reversed(layers): r = stride * r + (kernel_size - 1) return r # Example: VGG-style stack layers = [ (3, 1), # Conv 3×3, stride 1 (3, 1), # Conv 3×3, stride 1 (2, 2), # Pool 2×2, stride 2 (3, 1), # Conv 3×3, stride 1 (3, 1), # Conv 3×3, stride 1 (2, 2), # Pool 2×2, stride 2 ] print(f"Receptive field: {compute_receptive_field(layers)}×{compute_receptive_field(layers)}") ``` *** ## Classic CNN Architectures | Architecture | Year | Key Innovation | Depth | | ---------------- | ---- | -------------------- | -------- | | **LeNet-5** | 1998 | First successful CNN | 5 | | **AlexNet** | 2012 | ReLU, Dropout, GPU | 8 | | **VGGNet** | 2014 | Small 3×3 filters | 16-19 | | **GoogLeNet** | 2014 | Inception modules | 22 | | **ResNet** | 2015 | Skip connections | 50-152 | | **DenseNet** | 2017 | Dense connections | 121-264 | | **EfficientNet** | 2019 | Compound scaling | variable | ### VGG-16 Implementation ```python theme={null} class VGG16(nn.Module): """Simplified VGG-16 architecture.""" def __init__(self, num_classes=1000): super().__init__() self.features = nn.Sequential( # Block 1 nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(), nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), # Block 2 nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), # Block 3 nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(), nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(), nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), # Block 4 nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(), nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), # Block 5 nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2), ) self.classifier = nn.Sequential( nn.Linear(512 * 7 * 7, 4096), nn.ReLU(), nn.Dropout(0.5), nn.Linear(4096, 4096), nn.ReLU(), nn.Dropout(0.5), nn.Linear(4096, num_classes), ) def forward(self, x): x = self.features(x) x = x.view(x.size(0), -1) x = self.classifier(x) return x ``` *** ## Exercises Implement and apply these classic filters: 1. Gaussian blur (5×5) 2. Laplacian edge detector 3. Custom "cross" pattern detector Apply them to real images and visualize results. Write a function that computes output dimensions for any sequence of conv and pool layers: ```python theme={null} def compute_output_size(input_size, layers): """ layers: list of dicts with keys: 'type': 'conv' or 'pool' 'kernel': kernel size 'stride': stride 'padding': padding """ pass ``` Build a CNN for CIFAR-10 (32×32 color images, 10 classes): 1. Design architecture to achieve >85% accuracy 2. Use batch normalization and dropout 3. Visualize learned filters and feature maps 4. Analyze which classes are confused Implement depthwise separable convolutions (used in MobileNet): 1. Depthwise: one filter per input channel 2. Pointwise: 1×1 convolution to mix channels Compare parameters and speed to standard convolutions. *** ## Key Takeaways | Concept | Key Insight | | --------------------- | -------------------------------- | | **Convolution** | Local patterns, shared weights | | **Pooling** | Downsample, add invariance | | **Stride** | Skip pixels, reduce dimensions | | **Padding** | Control output size | | **Feature hierarchy** | Edges → Shapes → Parts → Objects | | **Receptive field** | What input region affects output | *** ## What's Next Build modern CNN architectures — VGG, ResNet, EfficientNet design principles. *** ## Interview Deep-Dive **Strong Answer:** * **Weight sharing** means the same convolutional filter is applied at every spatial position. This encodes the inductive bias of translation equivariance: a feature detector that recognizes a cat ear in the top-left should recognize it in the bottom-right. Without weight sharing, the network would need to independently learn the same pattern for every possible location, requiring far more parameters and data. * **Local connectivity** means each neuron only connects to a small spatial patch (the receptive field), not the entire image. This encodes the locality bias: nearby pixels are more related than distant ones. An edge at position $(10, 10)$ depends on pixels at $(9,9)$ through $(11,11)$, not on a pixel at $(200, 200)$. * Together, these biases reduce the parameter count by orders of magnitude: a 3x3 convolution with 32 filters on a 224x224 image uses 896 parameters instead of 150 million for a fully connected layer. This massive reduction acts as strong regularization, preventing overfitting. * **When they fail**: (1) Tasks requiring global context, like counting objects across the entire image -- local receptive fields miss long-range dependencies. (2) Data where the same pattern at different locations has different meanings -- medical images where a lesion on the left lung has different significance than on the right. (3) Non-grid-structured data -- graphs, point clouds, or sets of variable-length items. This is precisely why Vision Transformers, which have global attention without locality bias, can outperform CNNs when sufficient data is available to learn spatial relationships from scratch. **Follow-up: ViTs learn to attend to any position -- does this mean locality bias is unnecessary?** Not unnecessary, just learnable at a cost. ViTs need much more data (14M+ images for ViT-B) to learn the spatial relationships that CNNs encode for free. With sufficient data, ViTs discover locality patterns similar to CNNs in early layers (attention maps show local attention) and global patterns in later layers. Hybrid architectures like Swin Transformer reintroduce locality through windowed attention, achieving CNN-like data efficiency while maintaining the global reasoning capability for later layers. The trend is toward architectures that use locality bias where it helps (early layers) and global attention where it helps (later layers). **Strong Answer:** * The receptive field is the region of the input image that can influence a particular neuron's output. A neuron in the first 3x3 conv layer has a 3x3 receptive field. After stacking two 3x3 conv layers, the effective receptive field is 5x5. After three, it is 7x7. * **Why it is critical**: the receptive field determines what scale of features the network can detect. To recognize a face (roughly 100x100 pixels in a typical image), the final convolutional features must have a receptive field of at least 100x100. If the receptive field is only 50x50, no single neuron can "see" the entire face, and the network must rely on the fully connected layers to integrate partial information. * **Computing receptive field**: for a stack of $n$ layers with kernel size $k$ and stride $s$, the receptive field grows as $r_{l} = r_{l-1} + (k_l - 1) \cdot \prod_{i=1}^{l-1} s_i$. Pooling layers with stride 2 are powerful receptive field amplifiers because every subsequent layer's kernel covers twice as much input space. * **VGG's insight**: two 3x3 convolutions have the same receptive field as one 5x5, but with fewer parameters ($2 \times 3^2 = 18$ vs. $5^2 = 25$) and more non-linearity (two ReLU activations instead of one). Three 3x3 convolutions match a 7x7 filter ($3 \times 9 = 27$ vs. $49$) with even more savings. This is why modern CNNs exclusively use 3x3 kernels stacked deeply. * The **effective receptive field** (what the neuron actually "attends to") is typically much smaller than the theoretical receptive field because weights near the center have more influence than those at the edges. This follows a Gaussian distribution, not a uniform one. **Follow-up: How do dilated (atrous) convolutions expand the receptive field without increasing parameters?** Dilated convolutions insert gaps between kernel elements. A 3x3 kernel with dilation 2 covers a 5x5 area (with gaps), providing a 5x5 receptive field with only 9 parameters. Stacking dilated convolutions with exponentially increasing dilation rates (1, 2, 4, 8, 16) creates very large receptive fields without pooling -- preserving spatial resolution. This is essential for tasks like semantic segmentation where you need both global context (large receptive field) and pixel-level precision (no downsampling). The trade-off is that dilated convolutions can create "gridding artifacts" where the gaps in the kernel cause aliasing. DeepLab addresses this with ASPP (Atrous Spatial Pyramid Pooling), which uses multiple dilation rates in parallel and aggregates the results. **Strong Answer:** * **Output spatial dimensions**: $H_{out} = \lfloor (H_{in} + 2P - K) / S \rfloor + 1$, where $H_{in}$ is input height, $P$ is padding, $K$ is kernel size, $S$ is stride. Same formula for width. * **Concrete example**: input $(B, 3, 224, 224)$, Conv2d(3, 64, kernel\_size=7, stride=2, padding=3): * $H_{out} = \lfloor (224 + 2 \times 3 - 7) / 2 \rfloor + 1 = \lfloor 223/2 \rfloor + 1 = 112$ * Output shape: $(B, 64, 112, 112)$ * **Parameter count**: $(K_H \times K_W \times C_{in} + 1) \times C_{out}$ (the +1 is for the bias per filter). * $(7 \times 7 \times 3 + 1) \times 64 = 148 \times 64 = 9,\!472$ parameters. * Without bias: $7 \times 7 \times 3 \times 64 = 9,\!408$. Many modern architectures use `bias=False` when followed by batch normalization, since BN's learned shift parameter ($\beta$) makes the bias redundant. * **Compute cost (FLOPs)**: roughly $2 \times K^2 \times C_{in} \times C_{out} \times H_{out} \times W_{out}$ (multiply-accumulate operations). For our example: $2 \times 49 \times 3 \times 64 \times 112 \times 112 \approx 236M$ FLOPs. This single layer accounts for a significant fraction of a ResNet's total compute because of the large spatial dimensions. **Follow-up: Why is it common to double the number of channels when halving spatial dimensions (e.g., 64 channels at 56x56, 128 at 28x28)?** This design principle (from VGG and adopted by ResNet) keeps the total "information capacity" roughly constant across layers. When spatial dimensions are halved by stride-2 convolution or pooling, the number of spatial positions drops by 4x ($56 \times 56 = 3136$ vs. $28 \times 28 = 784$). Doubling the channel count partially compensates, keeping the total number of activations ($C \times H \times W$) from dropping too drastically. If channels were not increased, later layers would represent the input in progressively lower-dimensional spaces, creating information bottlenecks. Conversely, increasing channels beyond 2x would make later layers disproportionately expensive. The 2x rule is a practical sweet spot between information preservation and computational efficiency.