Multimodal Models
Connecting Vision and Language
Multimodal models understand multiple types of data — images, text, audio — in a shared representation space. Key applications:- Image-text search
- Visual question answering
- Image captioning
- Text-to-image generation
CLIP: Contrastive Language-Image Pretraining
Core Idea
Learn a shared embedding space where matching image-text pairs are close together.CLIP Implementation
Using Pretrained CLIP
Zero-Shot Classification
Visual Question Answering (VQA)
Image Captioning
LLaVA: Large Language-and-Vision Assistant
Connecting vision encoders with LLMs:Building a Multimodal Model
Vision-Language Projector
Image-Text Retrieval
Multimodal Model Comparison
| Model | Vision | Language | Training Data | Use Case |
|---|---|---|---|---|
| CLIP | ViT/ResNet | Transformer | 400M pairs | Retrieval, zero-shot |
| BLIP | ViT | BERT | 129M pairs | Captioning, VQA |
| LLaVA | CLIP | LLaMA/Vicuna | 600K | Conversation |
| GPT-4V | Proprietary | GPT-4 | Unknown | General |
| Flamingo | NFNet | Chinchilla | Interleaved | Few-shot |
Exercises
Exercise 1: Zero-Shot Classifier
Exercise 1: Zero-Shot Classifier
Build a zero-shot image classifier using CLIP for a custom set of classes.
Exercise 2: Image Search Engine
Exercise 2: Image Search Engine
Create a text-to-image search system using CLIP embeddings.
Exercise 3: Custom VQA
Exercise 3: Custom VQA
Fine-tune BLIP on a custom VQA dataset and evaluate performance.