Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Object Detection
The Detection Problem
Image classification answers “what is in this image?” Object detection answers the much harder question: “what objects are in this image, and where exactly is each one?” This combines two sub-problems into one: Object detection = classification + localization For each object in an image, we need:- What: Class label (car, person, dog…)
- Where: Bounding box (x,y,w,h) specifying the tight rectangle around the object
When choosing a detection framework, the single most important question is: what is your latency budget? If you need real-time (under 30ms), start with YOLO or a single-stage anchor-free method. If accuracy matters more than speed (e.g., medical imaging), Faster R-CNN or DETR with a strong backbone is worth the extra latency. For a production-ready starting point, the torchvision or ultralytics YOLO libraries get you to a working system in under 50 lines of code.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from typing import List, Tuple, Dict, Optional
torch.manual_seed(42)
Detection Paradigms
Two-Stage Detectors
The two-stage approach mirrors how humans scan a scene: first, quickly identify regions that might contain objects (“there is something interesting in the top-left corner”), then look closely at each candidate to determine what it is and refine its location. This “coarse then fine” strategy is slower but typically more accurate than single-stage methods. Generate proposals (stage 1) then classify and refine (stage 2):class RegionProposalNetwork(nn.Module):
"""
Region Proposal Network (RPN) from Faster R-CNN.
Generates object proposals from feature maps. The key innovation of
Faster R-CNN: instead of using an external algorithm (Selective Search)
to propose regions, train a small network to do it. This made the entire
pipeline end-to-end trainable and much faster.
How it works: at every spatial position in the feature map, the RPN
places a set of "anchor" boxes of different sizes and aspect ratios,
then predicts (a) whether each anchor contains an object, and (b) how
to adjust the anchor to better fit the object.
"""
def __init__(
self,
in_channels: int,
anchor_sizes: List[int] = [128, 256, 512],
aspect_ratios: List[float] = [0.5, 1.0, 2.0]
):
super().__init__()
self.anchor_sizes = anchor_sizes
self.aspect_ratios = aspect_ratios
self.num_anchors = len(anchor_sizes) * len(aspect_ratios)
# Shared conv
self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)
# Classification head (object vs background)
self.cls_head = nn.Conv2d(in_channels, self.num_anchors * 2, 1)
# Regression head (box deltas)
self.reg_head = nn.Conv2d(in_channels, self.num_anchors * 4, 1)
def generate_anchors(
self,
feature_map: torch.Tensor,
stride: int = 16
) -> torch.Tensor:
"""Generate anchor boxes for feature map."""
_, _, H, W = feature_map.shape
device = feature_map.device
# Grid of anchor centers
shifts_x = torch.arange(0, W, device=device) * stride + stride // 2
shifts_y = torch.arange(0, H, device=device) * stride + stride // 2
shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x, indexing='ij')
shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=-1)
shifts = shifts.reshape(-1, 4)
# Base anchors (centered at origin)
base_anchors = []
for size in self.anchor_sizes:
for ratio in self.aspect_ratios:
h = size / (ratio ** 0.5)
w = size * (ratio ** 0.5)
base_anchors.append([-w/2, -h/2, w/2, h/2])
base_anchors = torch.tensor(base_anchors, device=device)
# All anchors
anchors = shifts.unsqueeze(1) + base_anchors.unsqueeze(0)
anchors = anchors.reshape(-1, 4)
return anchors
def forward(
self,
features: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Args:
features: [B, C, H, W] backbone features
Returns:
objectness: [B, num_anchors*H*W, 2] objectness scores
bbox_deltas: [B, num_anchors*H*W, 4] box regression
anchors: [num_anchors*H*W, 4] anchor boxes
"""
batch_size = features.size(0)
x = F.relu(self.conv(features))
# Classification
objectness = self.cls_head(x) # [B, A*2, H, W]
objectness = objectness.permute(0, 2, 3, 1).reshape(batch_size, -1, 2)
# Regression
bbox_deltas = self.reg_head(x) # [B, A*4, H, W]
bbox_deltas = bbox_deltas.permute(0, 2, 3, 1).reshape(batch_size, -1, 4)
# Generate anchors
anchors = self.generate_anchors(features)
return objectness, bbox_deltas, anchors
class RoIPooling(nn.Module):
"""
Region of Interest Pooling.
Extract fixed-size features from variable-size proposals.
The problem: proposals come in all sizes (a car might be 200x100 pixels,
a stop sign might be 30x30). But the classification head needs a fixed-size
input. RoI Pooling solves this by dividing each proposal into a fixed grid
(e.g., 7x7) and max-pooling within each cell, producing a fixed 7x7 output
regardless of the proposal's original size.
Practical tip: RoI Align (used in Mask R-CNN) is strictly better than RoI
Pooling because it uses bilinear interpolation instead of snapping to grid
boundaries, avoiding quantization artifacts. Use RoI Align for any new work.
"""
def __init__(self, output_size: int = 7, spatial_scale: float = 1/16):
super().__init__()
self.output_size = output_size
self.spatial_scale = spatial_scale
def forward(
self,
features: torch.Tensor,
rois: torch.Tensor
) -> torch.Tensor:
"""
Args:
features: [B, C, H, W] feature maps
rois: [N, 5] (batch_idx, x1, y1, x2, y2)
Returns:
pooled: [N, C, output_size, output_size]
"""
# Scale RoIs to feature map coordinates
rois_scaled = rois.clone()
rois_scaled[:, 1:] *= self.spatial_scale
# Use torchvision's roi_pool
pooled = torchvision.ops.roi_pool(
features, rois_scaled,
output_size=self.output_size,
spatial_scale=1.0 # Already scaled
)
return pooled
class FasterRCNN(nn.Module):
"""
Faster R-CNN: Two-stage object detector.
Architecture:
1. Backbone (ResNet, etc.) → Feature maps
2. RPN → Region proposals
3. RoI Pooling → Fixed-size features
4. Detection head → Classes + refined boxes
"""
def __init__(
self,
num_classes: int,
backbone: nn.Module,
backbone_out_channels: int = 2048
):
super().__init__()
self.num_classes = num_classes
self.backbone = backbone
# RPN
self.rpn = RegionProposalNetwork(backbone_out_channels)
# RoI pooling
self.roi_pool = RoIPooling(output_size=7)
# Detection head
self.fc1 = nn.Linear(backbone_out_channels * 7 * 7, 1024)
self.fc2 = nn.Linear(1024, 1024)
self.cls_head = nn.Linear(1024, num_classes)
self.reg_head = nn.Linear(1024, num_classes * 4)
def forward(
self,
images: torch.Tensor,
targets: Optional[List[Dict]] = None
) -> Dict[str, torch.Tensor]:
"""
Args:
images: [B, 3, H, W] input images
targets: List of dicts with 'boxes' and 'labels' (training only)
"""
# Backbone
features = self.backbone(images)
# RPN
objectness, rpn_deltas, anchors = self.rpn(features)
# Get proposals (simplified - in practice use NMS)
proposals = self._get_proposals(anchors, rpn_deltas, objectness)
# RoI pooling
roi_features = self.roi_pool(features, proposals)
roi_features = roi_features.flatten(start_dim=1)
# Detection head
x = F.relu(self.fc1(roi_features))
x = F.relu(self.fc2(x))
cls_logits = self.cls_head(x)
box_deltas = self.reg_head(x)
return {
'cls_logits': cls_logits,
'box_deltas': box_deltas,
'rpn_objectness': objectness,
'rpn_deltas': rpn_deltas
}
def _get_proposals(
self,
anchors: torch.Tensor,
deltas: torch.Tensor,
scores: torch.Tensor,
top_k: int = 1000
) -> torch.Tensor:
"""Get top-k proposals after NMS."""
# Apply deltas to anchors
proposals = self._apply_deltas(anchors, deltas[0])
# Get objectness scores
scores = F.softmax(scores[0], dim=-1)[:, 1]
# Top-k
_, top_idx = scores.topk(min(top_k, len(scores)))
proposals = proposals[top_idx]
# Add batch index
batch_idx = torch.zeros(len(proposals), 1, device=proposals.device)
proposals = torch.cat([batch_idx, proposals], dim=1)
return proposals
def _apply_deltas(
self,
anchors: torch.Tensor,
deltas: torch.Tensor
) -> torch.Tensor:
"""Apply box deltas to anchors."""
# Convert anchors to (cx, cy, w, h)
widths = anchors[:, 2] - anchors[:, 0]
heights = anchors[:, 3] - anchors[:, 1]
cx = anchors[:, 0] + 0.5 * widths
cy = anchors[:, 1] + 0.5 * heights
# Apply deltas
dx, dy, dw, dh = deltas.unbind(-1)
pred_cx = dx * widths + cx
pred_cy = dy * heights + cy
pred_w = torch.exp(dw) * widths
pred_h = torch.exp(dh) * heights
# Convert back to (x1, y1, x2, y2)
pred_boxes = torch.stack([
pred_cx - 0.5 * pred_w,
pred_cy - 0.5 * pred_h,
pred_cx + 0.5 * pred_w,
pred_cy + 0.5 * pred_h
], dim=-1)
return pred_boxes
YOLO: Single-Stage Detection
YOLO took a radically different approach from the two-stage pipeline: forget proposals entirely, and predict everything in one shot. This is like glancing at a scene and instantly knowing what is where, rather than carefully scanning each region. The result: dramatically faster inference at a modest accuracy cost.YOLOv1 Core Concepts
class YOLOv1Head(nn.Module):
"""
YOLOv1: You Only Look Once (Redmon et al., 2016).
The key idea: treat detection as a single regression problem.
Divides image into S x S grid:
- Each cell predicts B bounding boxes
- Each box: (x, y, w, h, confidence)
- Plus C class probabilities per cell
Output: S x S x (B*5 + C)
Analogy: imagine overlaying a grid on the image. Each grid cell is
"responsible" for detecting objects whose center falls within it.
If an object's center is in cell (3, 4), that cell must predict its
bounding box and class.
Practical tip: YOLOv1 struggles with small objects (the grid is coarse)
and objects that appear in groups (each cell can only predict B boxes).
Later versions (v3+) fix this with multi-scale prediction.
"""
def __init__(
self,
num_classes: int = 20,
grid_size: int = 7,
num_boxes: int = 2
):
super().__init__()
self.S = grid_size
self.B = num_boxes
self.C = num_classes
self.output_size = self.S * self.S * (self.B * 5 + self.C)
# Final layers (after backbone)
self.fc = nn.Sequential(
nn.Flatten(),
nn.Linear(1024 * 7 * 7, 4096),
nn.LeakyReLU(0.1),
nn.Dropout(0.5),
nn.Linear(4096, self.output_size)
)
def forward(self, features: torch.Tensor) -> torch.Tensor:
"""
Returns:
predictions: [B, S, S, B*5+C]
"""
batch_size = features.size(0)
x = self.fc(features)
x = x.view(batch_size, self.S, self.S, self.B * 5 + self.C)
return x
def decode_predictions(
self,
predictions: torch.Tensor,
confidence_threshold: float = 0.5
) -> List[Dict]:
"""Decode YOLO predictions to boxes."""
batch_size = predictions.size(0)
results = []
for b in range(batch_size):
boxes = []
scores = []
labels = []
for i in range(self.S):
for j in range(self.S):
cell = predictions[b, i, j]
# Class probabilities
class_probs = cell[self.B * 5:]
for box_idx in range(self.B):
# Box predictions
box_offset = box_idx * 5
x = (cell[box_offset] + j) / self.S
y = (cell[box_offset + 1] + i) / self.S
w = cell[box_offset + 2] ** 2 # Squared to ensure positive
h = cell[box_offset + 3] ** 2
conf = cell[box_offset + 4]
# Class-specific confidence
class_scores = conf * class_probs
best_class = class_scores.argmax()
best_score = class_scores[best_class]
if best_score > confidence_threshold:
# Convert to (x1, y1, x2, y2)
x1 = x - w / 2
y1 = y - h / 2
x2 = x + w / 2
y2 = y + h / 2
boxes.append([x1, y1, x2, y2])
scores.append(best_score.item())
labels.append(best_class.item())
results.append({
'boxes': torch.tensor(boxes),
'scores': torch.tensor(scores),
'labels': torch.tensor(labels)
})
return results
YOLOv3 Architecture
class DarknetBlock(nn.Module):
"""Darknet residual block."""
def __init__(self, in_channels: int):
super().__init__()
hidden = in_channels // 2
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels, hidden, 1, bias=False),
nn.BatchNorm2d(hidden),
nn.LeakyReLU(0.1)
)
self.conv2 = nn.Sequential(
nn.Conv2d(hidden, in_channels, 3, padding=1, bias=False),
nn.BatchNorm2d(in_channels),
nn.LeakyReLU(0.1)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return x + self.conv2(self.conv1(x))
class YOLOv3Neck(nn.Module):
"""
YOLOv3 Feature Pyramid Neck.
Multi-scale predictions for better small object detection.
"""
def __init__(self, backbone_channels: List[int] = [256, 512, 1024]):
super().__init__()
# Lateral connections
self.lateral1 = nn.Conv2d(backbone_channels[2], 512, 1)
self.lateral2 = nn.Conv2d(backbone_channels[1], 256, 1)
self.lateral3 = nn.Conv2d(backbone_channels[0], 128, 1)
# Feature fusion
self.fuse1 = self._make_fuse_block(512 + 256, 256)
self.fuse2 = self._make_fuse_block(256 + 128, 128)
def _make_fuse_block(self, in_channels: int, out_channels: int):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels),
nn.LeakyReLU(0.1),
nn.Conv2d(out_channels, out_channels * 2, 3, padding=1, bias=False),
nn.BatchNorm2d(out_channels * 2),
nn.LeakyReLU(0.1),
nn.Conv2d(out_channels * 2, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels),
nn.LeakyReLU(0.1)
)
def forward(
self,
features: Tuple[torch.Tensor, ...]
) -> Tuple[torch.Tensor, ...]:
"""
Args:
features: (C3, C4, C5) from backbone
Returns:
(P3, P4, P5) multi-scale features
"""
c3, c4, c5 = features
# Top-down pathway
p5 = self.lateral1(c5)
p5_upsampled = F.interpolate(p5, size=c4.shape[2:], mode='nearest')
p4 = self.fuse1(torch.cat([p5_upsampled, self.lateral2(c4)], dim=1))
p4_upsampled = F.interpolate(p4, size=c3.shape[2:], mode='nearest')
p3 = self.fuse2(torch.cat([p4_upsampled, self.lateral3(c3)], dim=1))
return p3, p4, p5
class YOLOv3Head(nn.Module):
"""YOLOv3 detection head for one scale."""
def __init__(
self,
in_channels: int,
num_classes: int,
num_anchors: int = 3
):
super().__init__()
self.num_classes = num_classes
self.num_anchors = num_anchors
# 5 = (x, y, w, h, objectness)
out_channels = num_anchors * (5 + num_classes)
self.conv = nn.Sequential(
nn.Conv2d(in_channels, in_channels * 2, 3, padding=1, bias=False),
nn.BatchNorm2d(in_channels * 2),
nn.LeakyReLU(0.1),
nn.Conv2d(in_channels * 2, out_channels, 1)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Returns:
predictions: [B, A, H, W, 5+C]
"""
batch_size = x.size(0)
x = self.conv(x) # [B, A*(5+C), H, W]
# Reshape
x = x.view(batch_size, self.num_anchors, 5 + self.num_classes,
x.size(2), x.size(3))
x = x.permute(0, 1, 3, 4, 2) # [B, A, H, W, 5+C]
return x
Anchor-Free Detection
Anchor boxes are powerful but come with baggage: you need to manually design anchor sizes and ratios, you generate thousands of anchors per image (most are negative), and you need complex matching logic to assign ground truth boxes to anchors. Anchor-free methods cut through all of this by predicting objects directly from feature map locations.A common pitfall when switching from anchor-based to anchor-free detectors: the evaluation metrics (mAP) can look similar on COCO, but the failure modes are different. Anchor-free methods tend to struggle more with overlapping objects of very different sizes at the same location (e.g., a person holding a phone), while anchor-based methods struggle more with objects whose aspect ratios do not match any predefined anchor. Examine your failure cases qualitatively, not just the headline mAP number.
FCOS (Fully Convolutional One-Stage)
class FCOSHead(nn.Module):
"""
FCOS: Fully Convolutional One-Stage Object Detection (Tian et al., 2019).
No anchors! Instead, every pixel in the feature map that falls inside
a ground-truth box is a positive training sample. Each positive pixel
directly regresses to the box by predicting:
- Per-pixel classification (what class is the object I am inside?)
- Distance to box edges (l, t, r, b) -- how far am I from each edge?
- Centerness (am I near the center of the object? suppress low-quality
predictions from pixels near the edges)
Practical tip: FCOS is simpler to implement than anchor-based methods
and eliminates all anchor-related hyperparameters. It is a great default
for new detection projects.
"""
def __init__(
self,
in_channels: int,
num_classes: int,
num_convs: int = 4
):
super().__init__()
self.num_classes = num_classes
# Shared convolutions
cls_convs = []
reg_convs = []
for _ in range(num_convs):
cls_convs.append(nn.Conv2d(in_channels, in_channels, 3, padding=1))
cls_convs.append(nn.GroupNorm(32, in_channels))
cls_convs.append(nn.ReLU())
reg_convs.append(nn.Conv2d(in_channels, in_channels, 3, padding=1))
reg_convs.append(nn.GroupNorm(32, in_channels))
reg_convs.append(nn.ReLU())
self.cls_tower = nn.Sequential(*cls_convs)
self.reg_tower = nn.Sequential(*reg_convs)
# Prediction heads
self.cls_logits = nn.Conv2d(in_channels, num_classes, 3, padding=1)
self.bbox_pred = nn.Conv2d(in_channels, 4, 3, padding=1) # l, t, r, b
self.centerness = nn.Conv2d(in_channels, 1, 3, padding=1)
# Learnable scale for each FPN level
self.scales = nn.Parameter(torch.ones(1))
def forward(self, features: torch.Tensor) -> Dict[str, torch.Tensor]:
"""
Returns:
cls_logits: [B, C, H, W] classification
bbox_pred: [B, 4, H, W] box regression
centerness: [B, 1, H, W] centerness
"""
cls_feat = self.cls_tower(features)
reg_feat = self.reg_tower(features)
cls_logits = self.cls_logits(cls_feat)
bbox_pred = self.scales * self.bbox_pred(reg_feat)
bbox_pred = F.relu(bbox_pred) # Distances must be positive
centerness = self.centerness(reg_feat)
return {
'cls_logits': cls_logits,
'bbox_pred': bbox_pred,
'centerness': centerness
}
class CenterNet(nn.Module):
"""
CenterNet: Objects as Points (Zhou et al., 2019).
Elegantly simple: every object is represented by its center point.
The network predicts three things:
- Heatmap of object centers (one channel per class, Gaussian peaks at centers)
- Size (w, h) at each center (how big is the object?)
- Local offset for sub-pixel accuracy (since the heatmap is lower-res
than the input, the center position is quantized; the offset corrects this)
Practical tip: CenterNet is popular for its simplicity -- no anchor design,
no NMS needed (just find local maxima in the heatmap). It is a great
choice when you want a clean, easy-to-debug detection pipeline.
"""
def __init__(
self,
backbone: nn.Module,
num_classes: int,
head_channels: int = 64
):
super().__init__()
self.backbone = backbone
self.num_classes = num_classes
# Upsampling to recover resolution
self.deconv_layers = self._make_deconv_layers(
in_channels=2048, # Assuming ResNet
out_channels=head_channels
)
# Prediction heads
self.heatmap = nn.Sequential(
nn.Conv2d(head_channels, head_channels, 3, padding=1),
nn.ReLU(),
nn.Conv2d(head_channels, num_classes, 1)
)
self.wh = nn.Sequential(
nn.Conv2d(head_channels, head_channels, 3, padding=1),
nn.ReLU(),
nn.Conv2d(head_channels, 2, 1) # width, height
)
self.offset = nn.Sequential(
nn.Conv2d(head_channels, head_channels, 3, padding=1),
nn.ReLU(),
nn.Conv2d(head_channels, 2, 1) # x_offset, y_offset
)
def _make_deconv_layers(self, in_channels: int, out_channels: int):
layers = []
channels = [256, 128, 64]
for c in channels:
layers.extend([
nn.ConvTranspose2d(in_channels, c, 4, stride=2, padding=1, bias=False),
nn.BatchNorm2d(c),
nn.ReLU()
])
in_channels = c
return nn.Sequential(*layers)
def forward(self, images: torch.Tensor) -> Dict[str, torch.Tensor]:
features = self.backbone(images)
features = self.deconv_layers(features)
heatmap = self.heatmap(features)
heatmap = torch.sigmoid(heatmap) # Normalize to [0, 1]
wh = self.wh(features)
offset = self.offset(features)
return {
'heatmap': heatmap,
'wh': wh,
'offset': offset
}
@staticmethod
def decode_detections(
heatmap: torch.Tensor,
wh: torch.Tensor,
offset: torch.Tensor,
k: int = 100
) -> Dict[str, torch.Tensor]:
"""Decode CenterNet predictions."""
batch, num_classes, h, w = heatmap.shape
# Find local peaks
heatmap_max = F.max_pool2d(heatmap, 3, stride=1, padding=1)
keep = (heatmap == heatmap_max).float()
heatmap = heatmap * keep
# Top-k peaks
heatmap_flat = heatmap.view(batch, -1)
topk_scores, topk_indices = heatmap_flat.topk(k)
# Get class and position
topk_classes = topk_indices // (h * w)
topk_indices = topk_indices % (h * w)
topk_y = topk_indices // w
topk_x = topk_indices % w
# Get size and offset
wh = wh.view(batch, 2, -1).permute(0, 2, 1)
offset = offset.view(batch, 2, -1).permute(0, 2, 1)
# Gather predictions at peak locations
topk_wh = wh.gather(1, topk_indices.unsqueeze(-1).expand(-1, -1, 2))
topk_offset = offset.gather(1, topk_indices.unsqueeze(-1).expand(-1, -1, 2))
# Compute boxes
topk_cx = topk_x.float() + topk_offset[..., 0]
topk_cy = topk_y.float() + topk_offset[..., 1]
boxes = torch.stack([
topk_cx - topk_wh[..., 0] / 2,
topk_cy - topk_wh[..., 1] / 2,
topk_cx + topk_wh[..., 0] / 2,
topk_cy + topk_wh[..., 1] / 2
], dim=-1)
return {
'boxes': boxes,
'scores': topk_scores,
'labels': topk_classes
}
DETR: Detection Transformer
DETR (2020) was a paradigm shift: it reframes detection as a set prediction problem. Instead of generating thousands of proposals and filtering with NMS (non-maximum suppression), DETR uses a fixed set of learned “object queries” and a transformer to directly output the final set of detections. No anchors, no NMS, no hand-crafted components. The training uses the Hungarian algorithm for optimal matching between predictions and ground truth, treating detection like a bipartite matching problem.class DETR(nn.Module):
"""
DETR: DEtection TRansformer (Carion et al., 2020).
End-to-end object detection with transformers.
No NMS, no anchors!
Architecture: CNN backbone -> Transformer encoder (global reasoning
over image features) -> Transformer decoder (object queries attend
to relevant image regions) -> prediction heads
Practical tip: DETR struggles with small objects and requires very
long training (300+ epochs vs 12 for Faster R-CNN). Deformable DETR
fixes the slow convergence by using deformable attention.
"""
def __init__(
self,
backbone: nn.Module,
num_classes: int,
hidden_dim: int = 256,
nheads: int = 8,
num_encoder_layers: int = 6,
num_decoder_layers: int = 6,
num_queries: int = 100
):
super().__init__()
self.backbone = backbone
self.num_queries = num_queries
# Project backbone features
self.input_proj = nn.Conv2d(2048, hidden_dim, 1)
# Transformer
self.transformer = nn.Transformer(
d_model=hidden_dim,
nhead=nheads,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
dim_feedforward=2048,
dropout=0.1,
batch_first=True
)
# Object queries (learned)
self.query_embed = nn.Embedding(num_queries, hidden_dim)
# Position encoding
self.pos_encoder = PositionalEncoding2D(hidden_dim)
# Prediction heads
self.class_embed = nn.Linear(hidden_dim, num_classes + 1) # +1 for "no object"
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, num_layers=3)
def forward(self, images: torch.Tensor) -> Dict[str, torch.Tensor]:
"""
Args:
images: [B, 3, H, W]
Returns:
class_logits: [B, num_queries, num_classes+1]
pred_boxes: [B, num_queries, 4]
"""
batch_size = images.size(0)
# Backbone
features = self.backbone(images) # [B, C, H', W']
# Project and flatten
features = self.input_proj(features) # [B, hidden_dim, H', W']
# Position encoding
pos_embed = self.pos_encoder(features) # [B, hidden_dim, H', W']
# Flatten spatial dimensions
features = features.flatten(2).permute(0, 2, 1) # [B, H'*W', hidden_dim]
pos_embed = pos_embed.flatten(2).permute(0, 2, 1) # [B, H'*W', hidden_dim]
# Add position encoding
features = features + pos_embed
# Object queries
query_embed = self.query_embed.weight.unsqueeze(0).expand(batch_size, -1, -1)
tgt = torch.zeros_like(query_embed)
# Transformer
memory = self.transformer.encoder(features)
hs = self.transformer.decoder(tgt, memory, query_pos=query_embed)
# Predictions
class_logits = self.class_embed(hs) # [B, num_queries, num_classes+1]
pred_boxes = self.bbox_embed(hs).sigmoid() # [B, num_queries, 4]
return {
'class_logits': class_logits,
'pred_boxes': pred_boxes
}
class PositionalEncoding2D(nn.Module):
"""2D positional encoding for spatial features."""
def __init__(self, hidden_dim: int, temperature: float = 10000):
super().__init__()
self.hidden_dim = hidden_dim
self.temperature = temperature
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: [B, C, H, W]
Returns:
pos_encoding: [B, C, H, W]
"""
batch, _, h, w = x.shape
y_embed = torch.arange(h, device=x.device).float().unsqueeze(1).expand(h, w)
x_embed = torch.arange(w, device=x.device).float().unsqueeze(0).expand(h, w)
dim_t = torch.arange(self.hidden_dim // 2, device=x.device).float()
dim_t = self.temperature ** (2 * dim_t / self.hidden_dim)
pos_x = x_embed.unsqueeze(-1) / dim_t
pos_y = y_embed.unsqueeze(-1) / dim_t
pos_x = torch.stack([pos_x.sin(), pos_x.cos()], dim=-1).flatten(-2)
pos_y = torch.stack([pos_y.sin(), pos_y.cos()], dim=-1).flatten(-2)
pos = torch.cat([pos_y, pos_x], dim=-1).permute(2, 0, 1)
return pos.unsqueeze(0).expand(batch, -1, -1, -1)
class MLP(nn.Module):
"""Simple MLP for box prediction."""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
super().__init__()
layers = []
for i in range(num_layers):
in_dim = input_dim if i == 0 else hidden_dim
out_dim = output_dim if i == num_layers - 1 else hidden_dim
layers.append(nn.Linear(in_dim, out_dim))
if i < num_layers - 1:
layers.append(nn.ReLU())
self.layers = nn.Sequential(*layers)
def forward(self, x):
return self.layers(x)
class HungarianMatcher:
"""
Hungarian matching for DETR training.
Matches predictions to ground truth using optimal assignment.
The challenge: DETR outputs a fixed set of 100 predictions, but an
image might have 3 objects. Which 3 predictions should be matched to
which 3 objects? Hungarian matching finds the optimal one-to-one
assignment that minimizes a combined cost of classification error,
box distance, and GIoU. The remaining 97 predictions are trained to
predict "no object."
This is what makes DETR truly end-to-end: there is no hand-crafted
NMS or anchor matching -- the matching itself is learned (indirectly,
through the training objective).
"""
def __init__(
self,
cost_class: float = 1.0,
cost_bbox: float = 5.0,
cost_giou: float = 2.0
):
self.cost_class = cost_class
self.cost_bbox = cost_bbox
self.cost_giou = cost_giou
@torch.no_grad()
def __call__(
self,
outputs: Dict[str, torch.Tensor],
targets: List[Dict[str, torch.Tensor]]
) -> List[Tuple[torch.Tensor, torch.Tensor]]:
"""
Returns list of (pred_indices, target_indices) for each batch element.
"""
from scipy.optimize import linear_sum_assignment
batch_size, num_queries = outputs['class_logits'].shape[:2]
# Flatten predictions
pred_probs = outputs['class_logits'].softmax(-1)
pred_boxes = outputs['pred_boxes']
indices = []
for b in range(batch_size):
target_labels = targets[b]['labels']
target_boxes = targets[b]['boxes']
num_targets = len(target_labels)
if num_targets == 0:
indices.append((torch.tensor([]), torch.tensor([])))
continue
# Classification cost
cost_class = -pred_probs[b, :, target_labels]
# L1 box cost
cost_bbox = torch.cdist(pred_boxes[b], target_boxes, p=1)
# GIoU cost
cost_giou = -self._giou(pred_boxes[b], target_boxes)
# Total cost
C = (
self.cost_class * cost_class +
self.cost_bbox * cost_bbox +
self.cost_giou * cost_giou
)
# Hungarian algorithm
row_ind, col_ind = linear_sum_assignment(C.cpu().numpy())
indices.append((
torch.tensor(row_ind, dtype=torch.long),
torch.tensor(col_ind, dtype=torch.long)
))
return indices
def _giou(self, boxes1: torch.Tensor, boxes2: torch.Tensor) -> torch.Tensor:
"""Compute GIoU between two sets of boxes."""
return torchvision.ops.generalized_box_iou(boxes1, boxes2)
Comparison
def detection_comparison():
"""Compare detection approaches."""
comparison = """
╔════════════════════════════════════════════════════════════════════╗
║ OBJECT DETECTION COMPARISON ║
╠════════════════════════════════════════════════════════════════════╣
║ ║
║ Method mAP FPS Pros Cons ║
║ ───────────────────────────────────────────────────────────── ║
║ Faster R-CNN 42.0 ~5 High accuracy Slow ║
║ YOLOv3 33.0 ~45 Fast Lower accuracy ║
║ YOLOv5 50.7 ~140 Fast + accurate Complex ║
║ FCOS 44.7 ~23 Simple, no anchor Anchor-based ║
║ CenterNet 45.1 ~28 Simple Fixed scales ║
║ DETR 42.0 ~28 End-to-end Slow training ║
║ ║
╠════════════════════════════════════════════════════════════════════╣
║ WHEN TO USE WHAT ║
╠════════════════════════════════════════════════════════════════════╣
║ ║
║ • Real-time (>30 FPS): YOLOv5, YOLOv8 ║
║ • High accuracy: Faster R-CNN, Cascade R-CNN ║
║ • Simple training: FCOS, CenterNet ║
║ • Research/flexibility: DETR, Deformable DETR ║
║ • Small objects: FPN-based methods, high resolution ║
║ • Dense scenes: Anchor-free methods ║
║ ║
╚════════════════════════════════════════════════════════════════════╝
"""
print(comparison)
detection_comparison()
Exercises
Exercise 1: Implement IoU Loss
Exercise 1: Implement IoU Loss
Implement GIoU, DIoU, and CIoU losses:
def giou_loss(pred_boxes, target_boxes):
# Generalized IoU
pass
def diou_loss(pred_boxes, target_boxes):
# Distance IoU
pass
def ciou_loss(pred_boxes, target_boxes):
# Complete IoU
pass
Exercise 2: Multi-Scale Training
Exercise 2: Multi-Scale Training
Implement multi-scale training for YOLO:
- Randomly resize input during training
- Adjust anchor scales accordingly
- Handle variable batch sizes
Exercise 3: Soft-NMS
Exercise 3: Soft-NMS
Implement Soft-NMS for better overlapping object handling:
def soft_nms(boxes, scores, sigma=0.5, threshold=0.001):
# Decay overlapping box scores instead of removing
pass
What’s Next?
Semantic Segmentation
Per-pixel classification with U-Net, DeepLab
Hyperparameter Tuning
Systematic optimization of training configs