Vision Transformers
From CNNs to Transformers
CNNs dominated vision for a decade with their inductive biases:- Locality (convolutions)
- Translation equivariance
- Hierarchical features
ViT Architecture
Core Idea
Split image into patches → treat each patch as a token → apply transformer.Patch Embedding
Full ViT Implementation
ViT Variants
DeiT (Data-efficient Image Transformer)
Training improvements for ViT without massive data:Swin Transformer
Hierarchical with shifted windows for efficiency:Using Pretrained ViTs
ViT vs CNN Comparison
| Aspect | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality) | Weak (learns from data) |
| Data efficiency | Better with small data | Needs large data or distillation |
| Compute | Efficient | O(N²) attention |
| Scalability | Saturates | Scales well |
| Interpretability | Filter visualization | Attention maps |
Visualizing Attention
Exercises
Exercise 1: Build ViT from Scratch
Exercise 1: Build ViT from Scratch
Implement a complete ViT and train it on CIFAR-10 with proper augmentation.
Exercise 2: Attention Visualization
Exercise 2: Attention Visualization
Visualize attention maps for different images. What does the model attend to?
Exercise 3: ViT vs ResNet
Exercise 3: ViT vs ResNet
Compare ViT and ResNet on the same dataset. Analyze accuracy vs compute tradeoffs.