Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Vision Transformers
From CNNs to Transformers
CNNs dominated vision for a decade with their inductive biases:- Locality (convolutions): each filter only looks at a small spatial neighborhood, just as a doctor examines a tissue sample under a microscope rather than trying to see the whole body at once
- Translation equivariance: a feature detector that recognizes a cat ear in the top-left corner will recognize it in the bottom-right corner — no need to re-learn the same pattern at every position
- Hierarchical features: early layers detect edges, middle layers combine edges into textures and parts, and deep layers recognize objects
ViT Architecture
Core Idea
Split image into patches, treat each patch as a token, apply a standard Transformer. The analogy: Imagine reading a book by cutting each page into a grid of sticky notes, then feeding all the sticky notes into a reading comprehension model. Each sticky note (patch) is a “word” in the model’s vocabulary. The self-attention mechanism lets every patch attend to every other patch, so a patch showing a wheel can directly attend to a patch showing a road — even if they’re on opposite sides of the image. A CNN would need many stacked layers to propagate information across that distance. The math: For a 224x224 image with 16x16 patches, you get tokens, each of dimension (flattened patch pixels). The self-attention cost is where — manageable. But try 4x4 patches on 224x224 and you get 3,136 tokens — attention becomes the bottleneck.Patch Embedding
Full ViT Implementation
ViT Variants
DeiT (Data-efficient Image Transformer)
ViT’s original paper required 300M images (JFT-300M) for strong results. DeiT, from Facebook AI, showed you can train ViTs competitively on ImageNet alone (1.2M images) using better training recipes and knowledge distillation. The key innovation is a distillation token that learns to mimic a CNN teacher’s predictions:Swin Transformer
The problem with standard ViT: self-attention over all 196 patches is operations. Scale to 1024x1024 images with 16x16 patches and you get 4,096 tokens — attention costs per layer. That’s impractical for dense prediction tasks (segmentation, detection). Swin’s solution: Restrict attention to local windows (e.g., 7x7 patches), reducing complexity from to where is the window size. To allow cross-window communication, alternate layers shift the window grid by half a window — so patterns that span a window boundary in one layer fall inside a window in the next. This is hierarchical (like a CNN) but uses attention (like a ViT):Using Pretrained ViTs
ViT vs CNN Comparison
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality, translation equivariance) | Weak (learns spatial relationships from data) |
| Data efficiency | Better with small data (10K-100K images) | Needs large data (1M+) or distillation from a CNN teacher |
| Compute | Efficient; FLOPs scale linearly with image size | O(N^2) attention; quadratic in token count |
| Scalability | Accuracy saturates beyond ~300M params | Continues improving with more params and data |
| Interpretability | Filter visualization, GradCAM | Attention maps show which patches attend to which |
| Resolution flexibility | Works at any resolution natively | Requires interpolating positional embeddings for new resolutions |
| Dense prediction | Natural hierarchical features for detection/segmentation | Requires modifications (Swin, ViTDet) for multi-scale features |
Visualizing Attention
Exercises
Exercise 1: Build ViT from Scratch
Exercise 1: Build ViT from Scratch
Exercise 2: Attention Visualization
Exercise 2: Attention Visualization
Exercise 3: ViT vs ResNet
Exercise 3: ViT vs ResNet
Training Tips
Interview Deep-Dive
Explain how Vision Transformers process images. Why do they need positional embeddings, and what happens without them?
Explain how Vision Transformers process images. Why do they need positional embeddings, and what happens without them?
- ViTs split an image into fixed-size, non-overlapping patches (typically 16x16 pixels), flatten each patch into a vector, and linearly project it into an embedding space. These patch embeddings become the “tokens” for a standard Transformer encoder — identical to word tokens in NLP.
- A learnable [CLS] token is prepended to the sequence. After passing through Transformer layers (self-attention + MLP), the [CLS] token’s representation is used for classification. The intuition: through self-attention, the [CLS] token aggregates information from all patches, producing a global image summary.
- Why positional embeddings are necessary: self-attention is permutation-equivariant — swapping two tokens in the input swaps the corresponding outputs, but doesn’t change the attention computation. Without positional embeddings, the model cannot distinguish between an original image and a randomly shuffled version (same patches, different arrangement). Positional embeddings inject spatial order by adding a unique learned vector to each token based on its position.
- Without positional embeddings: the model can still learn to classify some images (since patch content alone carries information), but accuracy drops by 3-5% on ImageNet. Interestingly, the learned positional embeddings exhibit a 2D spatial structure when visualized — nearby patches have similar positional embeddings, and the model effectively recovers a notion of spatial locality from data.
- A senior engineer would note: the choice between learned absolute positional embeddings (ViT), sinusoidal (original Transformer), relative position bias (Swin), and rotary embeddings (RoPE, used in some recent vision models) significantly affects the model’s ability to generalize to new resolutions. Relative and rotary embeddings generalize better because they encode distances rather than absolute positions.
Compare ViT, DeiT, and Swin Transformer. What problem does each solve, and when would you choose one over the others?
Compare ViT, DeiT, and Swin Transformer. What problem does each solve, and when would you choose one over the others?
- ViT (Dosovitskiy et al., 2020): the original vision transformer. Pure self-attention over all patches. Demonstrated that transformers can match CNNs on vision, but required pretraining on JFT-300M (300 million images). Without massive pretraining, ViT underperforms ResNets on ImageNet. Choose when: you have access to large-scale pretraining data or pretrained checkpoints, and your task is image classification.
- DeiT (Touvron et al., 2021): same architecture as ViT, but with a better training recipe — strong data augmentation (RandAugment, Mixup, CutMix), knowledge distillation from a RegNet CNN teacher, and stochastic depth. Achieves 83.1% top-1 on ImageNet with only ImageNet-1K training data (1.2M images). The key insight: ViT’s poor performance on smaller datasets was a training problem, not an architecture problem. Choose when: you want ViT-level performance without JFT-scale pretraining data. Use the
deit_*variants fromtimmas your go-to starting point. - Swin Transformer (Liu et al., 2021): introduces hierarchical feature maps (like a CNN) and local window attention with shifted windows. Attention is instead of in image size, and the hierarchical structure produces multi-scale feature maps naturally. Choose when: you need a vision backbone for dense prediction tasks (object detection, semantic segmentation, instance segmentation) where multi-scale features are essential. Swin has largely replaced CNNs as the backbone in modern detection/segmentation frameworks (Mask R-CNN, Cascade R-CNN).
- Decision framework in practice: for classification tasks with pretrained checkpoints, DeiT or plain ViT fine-tuning is hard to beat. For detection/segmentation, Swin or ViTDet (ViT adapted for detection with simple feature pyramids). For edge deployment where latency matters, EfficientNet or MobileNet may still win because ViTs are harder to quantize and optimize for mobile hardware.
ViT's self-attention is O(N^2) in the number of patches. What are three approaches to make this more efficient, and what does each sacrifice?
ViT's self-attention is O(N^2) in the number of patches. What are three approaches to make this more efficient, and what does each sacrifice?
- Approach 1: Window attention (Swin). Restrict attention to local windows of size . Complexity drops from to , which is linear in image size. Shifted windows across alternating layers allow cross-window information flow. Sacrifice: global attention is only achieved after multiple layers of shifted windows; in early layers, distant patches cannot directly communicate. This is fine for most vision tasks but can hurt tasks requiring long-range pixel-level dependencies.
- Approach 2: Linear attention / kernel approximation. Replace the softmax attention kernel with a factored form where is a feature map (e.g., random Fourier features, ELU+1). This allows computing attention in instead of , which is cheaper when . Used in Performer, Linear Transformer. Sacrifice: the approximation of softmax attention is imperfect — sharp, peaked attention patterns (common in vision) are poorly approximated, leading to 1-3% accuracy drops on ImageNet.
- Approach 3: Token reduction / token merging. Progressively reduce the number of tokens through the network. Methods include: average pooling of neighboring tokens (PoolFormer), learned token merging based on similarity (ToMe — Token Merging), or keeping only the top-K most informative tokens (DynamicViT). Sacrifice: spatial resolution is lost, which is acceptable for classification but problematic for dense prediction tasks that need per-pixel outputs.
- A senior engineer would note: in practice, the cost of ViT at 224x224 resolution (196 tokens) is already fast enough for most use cases — it’s comparable to a ResNet-50 in FLOPs. The efficiency question becomes critical at high resolutions (1024x1024+) or in video (where tokens multiply by frame count). FlashAttention (Dao et al., 2022) is often the most practical solution: it doesn’t change the asymptotic complexity but reduces memory usage from to and achieves 2-4x wall-clock speedup through IO-aware tiling. It’s a systems optimization, not an algorithmic one, and it preserves exact attention semantics.
You need to deploy a ViT-based model for real-time image classification on a mobile device. Walk through your optimization strategy.
You need to deploy a ViT-based model for real-time image classification on a mobile device. Walk through your optimization strategy.
- Step 1: Model selection. Start with the smallest ViT variant that meets your accuracy requirement. DeiT-Tiny (5.7M params) or DeiT-Small (22M params) are good baselines. If latency is extremely tight, consider MobileViT, EfficientFormer, or FastViT — these are hybrid architectures specifically designed for mobile deployment.
- Step 2: Knowledge distillation. Train a smaller student ViT using a larger pretrained ViT or CNN ensemble as the teacher. This typically recovers 60-80% of the accuracy gap between the small and large models. The DeiT distillation approach (using a distillation token) is particularly effective.
- Step 3: Quantization. ViTs are harder to quantize than CNNs because attention logits can have high dynamic range. Use post-training quantization (PTQ) with careful calibration — quantize to INT8 for weights and activations. The LayerNorm and Softmax operations may need to stay in FP32 (mixed-precision quantization). Expect 1.5-2x speedup with less than 1% accuracy drop for INT8.
- Step 4: Token reduction at inference. Apply Token Merging (ToMe) to reduce the number of tokens by 30-50% after the first few layers. Visually similar neighboring patches are merged (averaged), dramatically reducing the attention computation for deeper layers. This is a free lunch — 2x speedup with less than 0.5% accuracy loss on ImageNet.
- Step 5: Export and runtime. Export via ONNX, then optimize with TensorRT (NVIDIA), CoreML (Apple), or TFLite (Android). These runtimes fuse operations (LayerNorm + Linear, attention kernel fusion) and use hardware-specific optimizations. On Apple devices with the Neural Engine, CoreML can run DeiT-Small at roughly 5ms per image on an iPhone 15.
- Key metric: measure end-to-end latency on the target device, not just FLOPs. ViTs have different bottlenecks than CNNs — attention is memory-bandwidth bound (not compute bound), so reducing FLOPs doesn’t always translate linearly to latency gains.