Foundation Models & LLMs
The Foundation Model Paradigm
Foundation models are large models trained on broad data that can be adapted to many downstream tasks. Key characteristics:- Scale (billions of parameters)
- Self-supervised pretraining
- Emergent capabilities
- Transfer to diverse tasks
Scaling Laws
The Chinchilla Scaling Law
For compute-optimal training: Where:- = number of parameters
- = dataset size (tokens)
- = compute budget (FLOPs)
| Model | Parameters | Training Tokens | Ratio |
|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 |
| Chinchilla | 70B | 1.4T | 20 |
| LLaMA 2 | 70B | 2T | 29 |
| Mistral | 7B | Unknown | - |
LLM Architecture
Modern Transformer Improvements
Rotary Position Embeddings (RoPE)
Pretraining Objectives
Causal Language Modeling (GPT-style)
Masked Language Modeling (BERT-style)
Emergent Capabilities
As models scale, new abilities emerge:| Scale | Emergent Capability |
|---|---|
| ~1B | Basic language understanding |
| ~10B | Few-shot learning |
| ~100B | Complex reasoning, code generation |
| ~500B+ | Multi-step reasoning, tool use |
Training LLMs
Distributed Training Setup
Instruction Tuning
RLHF (Reinforcement Learning from Human Feedback)
PPO Training
Using Foundation Models
Model Comparison
| Model | Size | Open | Strengths |
|---|---|---|---|
| GPT-4 | ~1T? | No | Multimodal, reasoning |
| Claude 3 | Unknown | No | Safety, long context |
| LLaMA 3 | 8B-70B | Yes | Open, efficient |
| Mistral | 7B | Yes | Quality/size ratio |
| Gemma | 2B-7B | Yes | Small, efficient |
Exercises
Exercise 1: Scaling Analysis
Exercise 1: Scaling Analysis
Plot loss vs compute for different model sizes. Verify the Chinchilla scaling law.
Exercise 2: Build a Mini-LLM
Exercise 2: Build a Mini-LLM
Train a small (10M parameter) causal language model on a text corpus.
Exercise 3: Instruction Tuning
Exercise 3: Instruction Tuning
Fine-tune a small LLM on instruction data using LoRA. Compare before/after.