Foundation Models & LLMs
The Foundation Model Paradigm
Scaling Laws
The Chinchilla Scaling Law
LLM Architecture
Modern Transformer Improvements
Rotary Position Embeddings (RoPE)
Pretraining Objectives
Causal Language Modeling (GPT-style)
Masked Language Modeling (BERT-style)
Emergent Capabilities
Training LLMs
Distributed Training Setup
Instruction Tuning
RLHF (Reinforcement Learning from Human Feedback)
PPO Training
Using Foundation Models
Model Comparison
Exercises
What’s Next

Foundation Models & LLMs

The Foundation Model Paradigm

Foundation models are large models trained on broad data that can be adapted to many downstream tasks. Key characteristics:

Scale (billions of parameters)
Self-supervised pretraining
Emergent capabilities
Transfer to diverse tasks

Scaling Laws

The Chinchilla Scaling Law

For compute-optimal training:

N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

Where:

$N$ = number of parameters
$D$ = dataset size (tokens)
$C$ = compute budget (FLOPs)

Rule of thumb: Train on ~20 tokens per parameter.

Model	Parameters	Training Tokens	Ratio
GPT-3	175B	300B	1.7
Chinchilla	70B	1.4T	20
LLaMA 2	70B	2T	29
Mistral	7B	Unknown	-

LLM Architecture

Modern Transformer Improvements

import torch
import torch.nn as nn
import torch.nn.functional as F

class ModernTransformerBlock(nn.Module):
    """LLaMA-style transformer block with modern improvements."""
    
    def __init__(self, dim, num_heads, mlp_ratio=4, dropout=0.0):
        super().__init__()
        
        # Pre-normalization with RMSNorm
        self.norm1 = RMSNorm(dim)
        self.norm2 = RMSNorm(dim)
        
        # Grouped Query Attention
        self.attn = GroupedQueryAttention(dim, num_heads, num_kv_heads=num_heads // 4)
        
        # SwiGLU MLP
        self.mlp = SwiGLU(dim, int(dim * mlp_ratio * 2/3))
    
    def forward(self, x, freqs_cis=None):
        # Pre-norm + residual
        x = x + self.attn(self.norm1(x), freqs_cis)
        x = x + self.mlp(self.norm2(x))
        return x


class RMSNorm(nn.Module):
    """Root Mean Square Normalization."""
    
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight


class SwiGLU(nn.Module):
    """SwiGLU activation (better than ReLU/GELU for LLMs)."""
    
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

Rotary Position Embeddings (RoPE)

def precompute_freqs_cis(dim, max_seq_len, base=10000):
    """Precompute rotary embedding frequencies."""
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len)
    freqs = torch.outer(t, freqs)
    return torch.polar(torch.ones_like(freqs), freqs)  # Complex exponentials


def apply_rotary_emb(xq, xk, freqs_cis):
    """Apply rotary embeddings to queries and keys."""
    # Reshape to complex
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    
    # Apply rotation
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(-2)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(-2)
    
    return xq_out.type_as(xq), xk_out.type_as(xk)

Pretraining Objectives

Causal Language Modeling (GPT-style)

\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_{<t})

def causal_lm_loss(logits, labels):
    """Next token prediction loss."""
    # Shift so we predict next token
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=-100  # Padding
    )
    return loss

Masked Language Modeling (BERT-style)

def create_mlm_inputs(tokens, mask_prob=0.15, vocab_size=32000):
    """Create masked inputs for MLM training."""
    labels = tokens.clone()
    
    # Random mask selection
    probability_matrix = torch.full(tokens.shape, mask_prob)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    
    # Don't mask special tokens
    labels[~masked_indices] = -100
    
    # 80% [MASK], 10% random, 10% unchanged
    indices_replaced = masked_indices & (torch.rand(tokens.shape) < 0.8)
    tokens[indices_replaced] = MASK_TOKEN_ID
    
    indices_random = masked_indices & ~indices_replaced & (torch.rand(tokens.shape) < 0.5)
    tokens[indices_random] = torch.randint(vocab_size, tokens.shape)[indices_random]
    
    return tokens, labels

Emergent Capabilities

As models scale, new abilities emerge:

Scale	Emergent Capability
~1B	Basic language understanding
~10B	Few-shot learning
~100B	Complex reasoning, code generation
~500B+	Multi-step reasoning, tool use

Training LLMs

Distributed Training Setup

import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

def train_llm():
    # Initialize distributed
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    
    # Create model and wrap with FSDP
    model = LLM(config)
    model = FSDP(
        model,
        mixed_precision=MixedPrecision(
            param_dtype=torch.bfloat16,
            reduce_dtype=torch.float32,
        ),
        sharding_strategy=ShardingStrategy.FULL_SHARD,
    )
    
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        betas=(0.9, 0.95),
        weight_decay=0.1,
    )
    
    # Training loop
    for step, batch in enumerate(dataloader):
        loss = model(batch).loss
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        optimizer.zero_grad()
        
        if rank == 0 and step % 100 == 0:
            print(f"Step {step}: loss = {loss.item():.4f}")

Instruction Tuning

INSTRUCTION_TEMPLATE = """<|system|>
{system_message}
<|user|>
{instruction}
<|assistant|>
{response}"""

def format_instruction_data(example):
    return INSTRUCTION_TEMPLATE.format(
        system_message="You are a helpful assistant.",
        instruction=example["instruction"],
        response=example["response"],
    )

# Fine-tune on instruction dataset
instruction_dataset = load_dataset("instruction_data")
formatted = instruction_dataset.map(format_instruction_data)

RLHF (Reinforcement Learning from Human Feedback)

class RewardModel(nn.Module):
    """Reward model trained on human preferences."""
    
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]  # Last token
        return self.reward_head(last_hidden)


def compute_preference_loss(reward_model, chosen, rejected):
    """Bradley-Terry preference model loss."""
    reward_chosen = reward_model(**chosen)
    reward_rejected = reward_model(**rejected)
    
    loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
    return loss

PPO Training

def ppo_step(policy, ref_policy, reward_model, prompts, kl_coef=0.1):
    """Single PPO update step."""
    # Generate responses
    responses = policy.generate(prompts)
    
    # Compute rewards
    rewards = reward_model(prompts, responses)
    
    # Compute KL penalty
    with torch.no_grad():
        ref_logprobs = ref_policy.log_prob(prompts, responses)
    policy_logprobs = policy.log_prob(prompts, responses)
    kl = policy_logprobs - ref_logprobs
    
    # Total reward = reward - KL penalty
    total_reward = rewards - kl_coef * kl
    
    # PPO clipped objective
    ratio = torch.exp(policy_logprobs - old_logprobs)
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    loss = -torch.min(ratio * total_reward, clipped * total_reward).mean()
    
    return loss

Using Foundation Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pretrained LLM
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

def generate(prompt, max_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Model Comparison

Model	Size	Open	Strengths
GPT-4	~1T?	No	Multimodal, reasoning
Claude 3	Unknown	No	Safety, long context
LLaMA 3	8B-70B	Yes	Open, efficient
Mistral	7B	Yes	Quality/size ratio
Gemma	2B-7B	Yes	Small, efficient

Exercises

Exercise 1: Scaling Analysis

Plot loss vs compute for different model sizes. Verify the Chinchilla scaling law.

Exercise 2: Build a Mini-LLM

Train a small (10M parameter) causal language model on a text corpus.

Exercise 3: Instruction Tuning

Fine-tune a small LLM on instruction data using LoRA. Compare before/after.

What’s Next

Module 26: Capstone Project

Build a complete deep learning system from scratch.

Multimodal Capstone

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Foundation Models & LLMs

​The Foundation Model Paradigm

​Scaling Laws

​The Chinchilla Scaling Law

​LLM Architecture

​Modern Transformer Improvements

​Rotary Position Embeddings (RoPE)

​Pretraining Objectives

​Causal Language Modeling (GPT-style)

​Masked Language Modeling (BERT-style)

​Emergent Capabilities

​Training LLMs

​Distributed Training Setup

​Instruction Tuning

​RLHF (Reinforcement Learning from Human Feedback)

​PPO Training

​Using Foundation Models

​Model Comparison

​Exercises

​What’s Next