Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Residual Networks

Residual & Skip Connections

The Depth Problem

In 2015, researchers at Microsoft Research tried training networks with 100+ layers. They expected deeper = better — after all, a 56-layer network has strictly more capacity than a 20-layer one, so it should be able to represent everything the shallower network can, plus more. What actually happened: Deeper networks performed WORSE than shallower ones. The 56-layer network had higher training error than the 20-layer version. This was not overfitting — even training error was higher. The network had the capacity but could not learn to use it. This is the degradation problem, and it puzzled the field until Kaiming He and colleagues proposed a deceptively simple fix. An analogy: Imagine you are giving someone directions, but every instruction passes through a chain of translators. With 5 translators, the message arrives somewhat garbled. With 50, it is unrecognizable. The original intent (gradient signal) degrades with each hop, until the earliest layers receive essentially random noise as their learning signal.
The Degradation Problem: Adding more layers can hurt performance even when the network has more capacity. This happens because gradients vanish or explode through many layers, making it progressively harder for early layers to learn meaningful features. The paradox is that a deeper network should at worst match a shallower one (by learning identity mappings for extra layers), but standard training cannot find this solution.

The Residual Insight

The solution is beautifully simple: skip connections. Instead of learning H(x)H(x), learn the residual F(x)=H(x)xF(x) = H(x) - x: Output=F(x)+x\text{Output} = F(x) + x
Input (x)

    ├───────────────────────────────┐
    │                               │
    ▼                               │
┌────────┐                          │
│ Conv   │                          │
└────────┘                          │
    │                               │
    ▼                               │
┌────────┐                          │
│ Conv   │                          │  Skip Connection
└────────┘                          │
    │                               │
    ▼                               │
   (+) ◄────────────────────────────┘


Output (F(x) + x)

Why This Works

The key insight in plain English: instead of asking the network “what should the output be?”, we ask “what small change should we make to the input?” If the best thing to do is nothing (identity), the weights can simply go to zero. Learning “do nothing” is trivially easy; learning “perfectly reconstruct the input through two conv layers” is not. Think of it like editing a document. Without residuals, each layer rewrites the entire document from scratch. With residuals, each layer just suggests tracked changes — additions and deletions. If a layer has nothing useful to contribute, it proposes zero changes, and the document passes through untouched. This is far easier to learn.
  1. Identity is easy: If the optimal transformation is identity, the residual weights just go to zero — F(x)=0F(x) = 0 means output =x= x
  2. Gradient highway: During backpropagation, gradients flow directly through the skip connection (addition is a gradient-friendly operation — the gradient of x+F(x)x + F(x) with respect to xx always includes a term of 1, preventing vanishing)
  3. Ensemble effect: A ResNet with nn blocks can be viewed as an implicit ensemble of 2n2^n paths of different lengths, because each block can either transform or pass through
Training tip: The reason ResNets enabled 152-layer networks (and later 1000+ layers) is fundamentally about gradient flow. Without skip connections, gradients must pass through every weight matrix. With skip connections, they have a “shortcut highway” that preserves signal strength. Monitor gradient norms at early layers — with skip connections, they should remain within 1-2 orders of magnitude of later layers.
import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """Basic residual block -- the fundamental building unit of ResNets.
    
    Each block learns a residual function F(x) and adds it to the input:
    output = ReLU(F(x) + x)
    
    If the block has nothing useful to learn, F(x) collapses to zero
    and the input passes through unchanged.
    """
    
    def __init__(self, channels):
        super().__init__()
        # Two 3x3 convolutions with batch norm -- the 'F(x)' branch
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        identity = x  # Save the input for the skip connection
        
        # Residual branch: conv -> bn -> relu -> conv -> bn
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        out += identity  # Skip connection! This is where the magic happens.
        out = self.relu(out)  # Final activation AFTER the addition
        
        return out


class BottleneckBlock(nn.Module):
    """Bottleneck block for deeper networks (ResNet-50+)."""
    
    expansion = 4
    
    def __init__(self, in_channels, bottleneck_channels):
        super().__init__()
        out_channels = bottleneck_channels * self.expansion
        
        # 1x1 reduce
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        
        # 3x3 convolve
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        
        # 1x1 expand
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        self.relu = nn.ReLU(inplace=True)
        
        # Projection shortcut if dimensions change
        self.shortcut = nn.Identity()
        if in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x):
        identity = self.shortcut(x)
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        out += identity
        out = self.relu(out)
        
        return out

ResNet Architecture

class ResNet18(nn.Module):
    """Simplified ResNet-18."""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        
        # Initial convolution
        self.conv1 = nn.Conv2d(3, 64, 7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(3, stride=2, padding=1)
        
        # Residual blocks
        self.layer1 = self._make_layer(64, 64, 2)
        self.layer2 = self._make_layer(64, 128, 2, stride=2)
        self.layer3 = self._make_layer(128, 256, 2, stride=2)
        self.layer4 = self._make_layer(256, 512, 2, stride=2)
        
        # Classifier
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(512, num_classes)
    
    def _make_layer(self, in_channels, out_channels, num_blocks, stride=1):
        layers = []
        
        # First block may downsample
        layers.append(self._make_block(in_channels, out_channels, stride))
        
        # Remaining blocks
        for _ in range(1, num_blocks):
            layers.append(self._make_block(out_channels, out_channels))
        
        return nn.Sequential(*layers)
    
    def _make_block(self, in_channels, out_channels, stride=1):
        return ResidualBlockWithDownsample(in_channels, out_channels, stride)
    
    def forward(self, x):
        x = self.maxpool(self.relu(self.bn1(self.conv1(x))))
        
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        x = self.avgpool(x)
        x = x.flatten(1)
        x = self.fc(x)
        
        return x

DenseNet: Dense Connections

Where ResNet adds the input to the output (and the original information gets blended in), DenseNet takes a more aggressive approach: concatenate all previous features together. Think of it as a group chat where every layer can read every message from every previous layer, not just the most recent one. This maximal information sharing means no feature is ever “forgotten.” Instead of adding, DenseNet concatenates all previous features:
class DenseBlock(nn.Module):
    """Dense block with growth rate k."""
    
    def __init__(self, in_channels, growth_rate, num_layers):
        super().__init__()
        self.layers = nn.ModuleList()
        
        for i in range(num_layers):
            self.layers.append(nn.Sequential(
                nn.BatchNorm2d(in_channels + i * growth_rate),
                nn.ReLU(inplace=True),
                nn.Conv2d(in_channels + i * growth_rate, growth_rate, 3, padding=1)
            ))
    
    def forward(self, x):
        features = [x]
        
        for layer in self.layers:
            out = layer(torch.cat(features, dim=1))
            features.append(out)
        
        return torch.cat(features, dim=1)

U-Net: Skip Connections for Segmentation

U-Net applies skip connections differently from ResNet: instead of adding within a block, it bridges between the encoder (downsampling path) and decoder (upsampling path). The problem it solves is fundamental to segmentation: the encoder compresses spatial information into rich semantic features (it knows “this is a dog”), but loses precise spatial detail (it forgets exactly where the dog’s ear ends). Skip connections from the encoder pipe high-resolution spatial features directly to the decoder, giving it both the “what” and the “where.” Think of it like an architect designing a building. The high-level plan (encoder output) says “put a window here,” but you need the detailed measurements from the original blueprint (encoder skip features) to cut the window to the right size. U-Net combines encoder-decoder with skip connections for pixel-level predictions:
class UNet(nn.Module):
    """U-Net for image segmentation."""
    
    def __init__(self, in_channels=3, out_channels=1):
        super().__init__()
        
        # Encoder
        self.enc1 = self._block(in_channels, 64)
        self.enc2 = self._block(64, 128)
        self.enc3 = self._block(128, 256)
        self.enc4 = self._block(256, 512)
        
        self.pool = nn.MaxPool2d(2)
        
        # Bottleneck
        self.bottleneck = self._block(512, 1024)
        
        # Decoder
        self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
        self.dec4 = self._block(1024, 512)  # 512 + 512 from skip
        
        self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
        self.dec3 = self._block(512, 256)
        
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = self._block(256, 128)
        
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = self._block(128, 64)
        
        self.out = nn.Conv2d(64, out_channels, 1)
    
    def _block(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        # Encoder path (save for skip connections)
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        e4 = self.enc4(self.pool(e3))
        
        # Bottleneck
        b = self.bottleneck(self.pool(e4))
        
        # Decoder path (concatenate skip connections)
        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        
        return self.out(d1)
U-Net Architecture

Comparison

ArchitectureConnection TypeBest ForTrade-off
ResNetAddImage classificationSimple, proven, but additive blending can lose detail
DenseNetConcatenateFeature reuse, fewer paramsBetter gradient flow, but memory-hungry due to concatenation
U-NetSkip + ConcatSegmentationPreserves spatial detail, but doubles decoder channel counts
HighwayGated addSequence modelingLearnable skip vs transform ratio, but adds parameter overhead
Common pitfall — dimension mismatches: When the residual branch changes spatial dimensions (stride > 1) or channel count, you must add a projection shortcut (1x1 conv + BN) to the skip connection so the shapes match for addition. Forgetting this is one of the most common implementation bugs — your code will crash with a shape mismatch error at the out += identity line. The BottleneckBlock above handles this correctly with self.shortcut.
When to use what: ResNet is the safe default for classification. If you are memory-constrained and want maximum feature reuse, DenseNet with a modest growth rate (k=12-32) is worth trying. U-Net (or its variants like U-Net++ and nnU-Net) is the go-to for any pixel-level prediction task — segmentation, depth estimation, super-resolution, or diffusion model backbones.

Exercises

Compare gradient magnitudes at early layers for a 50-layer network with and without skip connections.
Implement ResNet-50 and ResNet-101 using bottleneck blocks.
Train U-Net on a simple segmentation task (e.g., cell segmentation).

What’s Next

Module 16: Normalization Techniques

Batch norm, layer norm, and other techniques for stabilizing training.