Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Foundation Models

Foundation Models & LLMs

The Foundation Model Paradigm

Foundation models are large models trained on broad data that can be adapted to many downstream tasks. The analogy is a well-rounded liberal arts education: rather than training a specialist from scratch for every task, you invest heavily in a broad base of knowledge, then specialize with minimal additional effort. GPT-3 was trained once at enormous cost, but it can be adapted to summarization, translation, code generation, and thousands of other tasks with just a few examples — or even zero examples. The term “foundation model” was coined by Stanford’s HAI in 2021, and it captures a paradigm shift: instead of building task-specific models from scratch, you build on top of a pre-existing “foundation” of learned knowledge. Key characteristics:
  • Scale — billions of parameters, trained on trillions of tokens
  • Self-supervised pretraining — learns from raw text without human annotation
  • Emergent capabilities — abilities that appear only at sufficient scale, not explicitly trained
  • Transfer to diverse tasks — one model, many applications via prompting or fine-tuning

Scaling Laws

The Chinchilla Scaling Law

DeepMind’s Chinchilla paper (2022) fundamentally changed how the industry thinks about training LLMs. The key finding: most models before Chinchilla were dramatically undertrained — they were too large for the amount of data they saw. For compute-optimal training: NoptC0.5,DoptC0.5N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5} Where:
  • NN = number of parameters
  • DD = dataset size (tokens)
  • CC = compute budget (FLOPs)
Rule of thumb: Train on ~20 tokens per parameter. This means a 7B parameter model should see roughly 140B tokens for compute-optimal training.
Why this matters in practice: Before Chinchilla, the trend was “bigger model = better.” Chinchilla showed that a 70B model trained on 1.4T tokens outperformed the 280B Gopher model trained on 300B tokens — at much lower inference cost. This shifted the industry toward smaller, better-trained models, and directly influenced LLaMA, Mistral, and other efficient model families.
ModelParametersTraining TokensRatio
GPT-3175B300B1.7
Chinchilla70B1.4T20
LLaMA 270B2T29
Mistral7BUnknown-

LLM Architecture

Modern Transformer Improvements

Modern LLMs use several key improvements over the original 2017 Transformer. Each one addresses a specific limitation discovered through years of scaling experiments. Understanding these is essential for reading current research papers and building production systems.
import torch
import torch.nn as nn
import torch.nn.functional as F

class ModernTransformerBlock(nn.Module):
    """LLaMA-style transformer block with modern improvements.
    
    Three key changes from the original Transformer:
    1. Pre-normalization (RMSNorm before attention/MLP, not after)
    2. Grouped Query Attention (fewer KV heads = faster inference)
    3. SwiGLU activation (empirically better than ReLU/GELU)
    """
    
    def __init__(self, dim, num_heads, mlp_ratio=4, dropout=0.0):
        super().__init__()
        
        # Pre-normalization with RMSNorm (simpler and faster than LayerNorm)
        self.norm1 = RMSNorm(dim)
        self.norm2 = RMSNorm(dim)
        
        # Grouped Query Attention: use 4x fewer KV heads than query heads
        # This dramatically reduces KV cache memory during inference
        self.attn = GroupedQueryAttention(dim, num_heads, num_kv_heads=num_heads // 4)
        
        # SwiGLU MLP: ~1% better than GELU across most benchmarks
        # The 2/3 factor compensates for SwiGLU having 3 weight matrices instead of 2
        self.mlp = SwiGLU(dim, int(dim * mlp_ratio * 2/3))
    
    def forward(self, x, freqs_cis=None):
        # Pre-norm + residual
        x = x + self.attn(self.norm1(x), freqs_cis)
        x = x + self.mlp(self.norm2(x))
        return x


class RMSNorm(nn.Module):
    """Root Mean Square Normalization.
    
    Simpler than LayerNorm: skips the mean-centering step.
    Empirically equivalent in quality but ~10% faster because
    it avoids computing the mean across the feature dimension.
    Used in LLaMA, Mistral, and most modern LLMs.
    """
    
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        # Compute root mean square (no mean subtraction, unlike LayerNorm)
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight


class SwiGLU(nn.Module):
    """SwiGLU activation (better than ReLU/GELU for LLMs).
    
    SwiGLU uses a gating mechanism: one linear projection creates a "gate"
    that controls how much information flows through another projection.
    Think of it as the network learning to selectively amplify or suppress
    different features at each position.
    
    Formula: SwiGLU(x) = (Swish(xW1)) * (xW3) then projected by W2
    """
    
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)  # Gate projection
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)   # Down projection
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)   # Up projection
    
    def forward(self, x):
        # F.silu = x * sigmoid(x), also called "Swish"
        # The gate (silu(w1(x))) controls the flow of the value (w3(x))
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

Rotary Position Embeddings (RoPE)

def precompute_freqs_cis(dim, max_seq_len, base=10000):
    """Precompute rotary embedding frequencies."""
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len)
    freqs = torch.outer(t, freqs)
    return torch.polar(torch.ones_like(freqs), freqs)  # Complex exponentials


def apply_rotary_emb(xq, xk, freqs_cis):
    """Apply rotary embeddings to queries and keys."""
    # Reshape to complex
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    
    # Apply rotation
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(-2)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(-2)
    
    return xq_out.type_as(xq), xk_out.type_as(xk)

Pretraining Objectives

Causal Language Modeling (GPT-style)

L=t=1TlogP(xtx<t)\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_{<t})
def causal_lm_loss(logits, labels):
    """Next token prediction loss.
    
    The fundamental training objective: at every position, predict the NEXT token.
    If the sequence is [The, cat, sat], we want:
      - Position 0 (The) -> predict "cat"
      - Position 1 (cat) -> predict "sat"
    This is why we shift logits and labels by one position.
    """
    # Shift: logits at position t predict the token at position t+1
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),  # Flatten to (batch*seq, vocab)
        shift_labels.view(-1),                          # Flatten to (batch*seq,)
        ignore_index=-100  # Ignore padding tokens in loss computation
    )
    return loss

Masked Language Modeling (BERT-style)

def create_mlm_inputs(tokens, mask_prob=0.15, vocab_size=32000):
    """Create masked inputs for MLM training."""
    labels = tokens.clone()
    
    # Random mask selection
    probability_matrix = torch.full(tokens.shape, mask_prob)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    
    # Don't mask special tokens
    labels[~masked_indices] = -100
    
    # 80% [MASK], 10% random, 10% unchanged
    indices_replaced = masked_indices & (torch.rand(tokens.shape) < 0.8)
    tokens[indices_replaced] = MASK_TOKEN_ID
    
    indices_random = masked_indices & ~indices_replaced & (torch.rand(tokens.shape) < 0.5)
    tokens[indices_random] = torch.randint(vocab_size, tokens.shape)[indices_random]
    
    return tokens, labels

Emergent Capabilities

One of the most fascinating (and debated) phenomena in foundation models: as models scale, new abilities appear that were not present at smaller scales. This is not a gradual improvement — it is more like phase transitions in physics, where water suddenly becomes ice at a critical temperature.
ScaleEmergent Capability
~1BBasic language understanding
~10BFew-shot learning
~100BComplex reasoning, code generation
~500B+Multi-step reasoning, tool use
Important nuance: The concept of “emergence” in LLMs is actively debated. Some researchers argue that emergence is partly an artifact of how we measure performance (e.g., using exact-match accuracy that jumps from 0% to near-100%). When using continuous metrics like log-likelihood, the improvement often looks gradual. Regardless, the practical reality is clear: larger models can do things smaller models cannot, and predicting exactly which capabilities will appear at which scale remains an open problem.

Training LLMs

Distributed Training Setup

import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

def train_llm():
    # Initialize distributed
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    
    # Create model and wrap with FSDP
    model = LLM(config)
    model = FSDP(
        model,
        mixed_precision=MixedPrecision(
            param_dtype=torch.bfloat16,
            reduce_dtype=torch.float32,
        ),
        sharding_strategy=ShardingStrategy.FULL_SHARD,
    )
    
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=3e-4,
        betas=(0.9, 0.95),
        weight_decay=0.1,
    )
    
    # Training loop
    for step, batch in enumerate(dataloader):
        loss = model(batch).loss
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        optimizer.step()
        optimizer.zero_grad()
        
        if rank == 0 and step % 100 == 0:
            print(f"Step {step}: loss = {loss.item():.4f}")

Instruction Tuning

INSTRUCTION_TEMPLATE = """<|system|>
{system_message}
<|user|>
{instruction}
<|assistant|>
{response}"""

def format_instruction_data(example):
    return INSTRUCTION_TEMPLATE.format(
        system_message="You are a helpful assistant.",
        instruction=example["instruction"],
        response=example["response"],
    )

# Fine-tune on instruction dataset
instruction_dataset = load_dataset("instruction_data")
formatted = instruction_dataset.map(format_instruction_data)

RLHF (Reinforcement Learning from Human Feedback)

RLHF is how raw language models become useful assistants. The base model can generate fluent text, but it might also be toxic, unhelpful, or dishonest. RLHF aligns the model with human preferences through a three-step process: (1) supervised fine-tuning on demonstrations, (2) training a reward model on human preference rankings, and (3) using RL (typically PPO) to optimize the policy against the reward model while staying close to the original model. The analogy: the base model is a talented but unsocialized intern. SFT teaches them basic professional behavior from examples. RLHF then refines their judgment by showing them pairs of responses and learning which ones humans prefer.
class RewardModel(nn.Module):
    """Reward model trained on human preferences."""
    
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
    
    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]  # Last token
        return self.reward_head(last_hidden)


def compute_preference_loss(reward_model, chosen, rejected):
    """Bradley-Terry preference model loss."""
    reward_chosen = reward_model(**chosen)
    reward_rejected = reward_model(**rejected)
    
    loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
    return loss

PPO Training

def ppo_step(policy, ref_policy, reward_model, prompts, kl_coef=0.1):
    """Single PPO update step.
    
    The KL penalty is crucial: without it, the model would quickly learn to
    "hack" the reward model by generating degenerate text that scores highly
    but is nonsensical. The KL term acts as a leash, keeping the policy
    close to the well-behaved reference model.
    """
    # Step 1: Generate responses with current policy
    responses = policy.generate(prompts)
    
    # Step 2: Score responses with the reward model
    rewards = reward_model(prompts, responses)
    
    # Step 3: Compute KL divergence from reference model (the "leash")
    with torch.no_grad():
        ref_logprobs = ref_policy.log_prob(prompts, responses)
    policy_logprobs = policy.log_prob(prompts, responses)
    kl = policy_logprobs - ref_logprobs  # Per-token KL approximation
    
    # Total reward = RM reward minus KL penalty
    # Higher kl_coef = stronger pull toward reference model = more conservative
    total_reward = rewards - kl_coef * kl
    
    # PPO clipped objective: prevent catastrophically large policy updates
    ratio = torch.exp(policy_logprobs - old_logprobs)
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
    loss = -torch.min(ratio * total_reward, clipped * total_reward).mean()
    
    return loss

Using Foundation Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load pretrained LLM
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

def generate(prompt, max_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Model Comparison

ModelSizeOpenStrengths
GPT-4~1T?NoMultimodal, reasoning
Claude 3UnknownNoSafety, long context
LLaMA 38B-70BYesOpen, efficient
Mistral7BYesQuality/size ratio
Gemma2B-7BYesSmall, efficient

Exercises

Plot loss vs compute for different model sizes. Verify the Chinchilla scaling law.
Train a small (10M parameter) causal language model on a text corpus.
Fine-tune a small LLM on instruction data using LoRA. Compare before/after.

What’s Next

Module 26: Capstone Project

Build a complete deep learning system from scratch.

Interview Deep-Dive

Strong Answer:The Chinchilla scaling law, from DeepMind’s 2022 paper, established that for a given compute budget, the optimal strategy is to scale model size and training data equally. The rule of thumb is roughly 20 tokens per parameter for compute-optimal training. This was a paradigm shift because prior to Chinchilla, the dominant approach (exemplified by GPT-3’s 175B parameters trained on only 300B tokens, a ratio of 1.7) was to build the biggest model you could afford and train it on whatever data you had.Chinchilla demonstrated that their 70B model trained on 1.4T tokens (ratio of 20) outperformed the 280B Gopher model on most benchmarks — at roughly 4x lower inference cost. This directly spawned the “smaller but better-trained” movement: LLaMA (7B-65B trained on 1-1.4T tokens), Mistral, and Gemma all follow Chinchilla-optimal or even over-trained ratios.Where it breaks down: the law assumes you care about training compute optimality. In practice, inference cost often dominates total cost of ownership. If you are serving a model to millions of users, a smaller model trained far beyond Chinchilla-optimal (say 7B parameters on 2T+ tokens, ratio of 285) is cheaper to serve even though you “wasted” training compute. LLaMA 3 trained 8B parameters on 15T tokens — a ratio of 1,875 — precisely because inference savings at Meta’s scale vastly outweigh extra training cost. The scaling law also assumes fixed data quality; in practice, data curation and deduplication can shift the optimal ratio significantly.Follow-up: If you were given a fixed GPU budget of 1000 H100-hours to train a model for a specific production task, how would you allocate between model size and training data?I would first estimate the total compute in FLOPs: roughly 1000 hours * 990 TFLOPS (H100 BF16) = about 3.6e21 FLOPs. Using the Chinchilla formula, that suggests roughly a 1.3B parameter model trained on 26B tokens for compute-optimal training. But since this is for production deployment, I would skew toward a smaller model (say 400M parameters) trained on the same token budget, over-training by about 3x. The reasoning: at serving time, a 400M model is 3x cheaper and faster, and the over-training penalty on final loss is modest — maybe 2-5% worse perplexity. I would also invest in high-quality domain-specific data rather than more generic web text, since data quality has been shown to shift the effective training ratio favorably.
Strong Answer:There are four major changes, each addressing a specific limitation discovered through years of scaling.First, pre-normalization with RMSNorm instead of post-normalization with LayerNorm. The original Transformer applied LayerNorm after the residual addition (post-norm). This creates training instability at scale because the residual stream’s magnitude can grow unpredictably. Pre-norm applies normalization before the attention and MLP sublayers, which stabilizes the residual stream and makes training more robust. RMSNorm specifically drops the mean-centering step of LayerNorm, saving about 10% compute with empirically equivalent quality.Second, Grouped Query Attention (GQA) instead of Multi-Head Attention. Standard MHA uses the same number of key-value heads as query heads. GQA uses fewer KV heads (typically 4-8x fewer). This dramatically reduces the KV cache memory during autoregressive inference. For a 70B model serving long sequences, the KV cache can consume 20+ GB; GQA cuts this proportionally. The quality impact is minimal because keys and values are shared across groups of query heads, and empirically the model learns to use this shared structure effectively.Third, SwiGLU activation instead of ReLU or GELU in the feedforward network. SwiGLU uses a gating mechanism where one linear projection acts as a learned gate that modulates another projection. This consistently beats ReLU and GELU by about 1% on downstream benchmarks. The hidden dimension is adjusted by a factor of 2/3 to compensate for SwiGLU having three weight matrices instead of two, keeping the parameter count constant.Fourth, Rotary Position Embeddings (RoPE) instead of absolute sinusoidal or learned position embeddings. RoPE encodes position information by rotating query and key vectors in 2D subspaces. This has a critical advantage: the attention score between two tokens depends only on their relative distance, not their absolute positions. This enables better length generalization — a model trained on 4K context can often extrapolate to 8K or beyond with minimal degradation, which absolute embeddings cannot do.Follow-up: If you had to drop one of these improvements to simplify your implementation, which would you sacrifice and why?I would drop SwiGLU and revert to GELU. The accuracy gain from SwiGLU (roughly 1%) is the smallest of the four changes, while the implementation complexity is non-trivial (three weight matrices instead of two, adjusted hidden dimension). Pre-norm and RoPE are essential for training stability and length generalization respectively, and GQA is essential for practical inference at scale. GELU with a standard two-matrix FFN is simpler, well-understood, and the 1% quality gap can often be closed with slightly more training data.
Strong Answer:RLHF has three phases. Phase one: supervised fine-tuning (SFT) on high-quality demonstration data teaches the model basic instruction-following behavior. Phase two: a reward model is trained on human preference data — annotators see pairs of model outputs and indicate which they prefer, and the reward model learns to predict these preferences using a Bradley-Terry loss. Phase three: the policy model (the actual LLM) is optimized via PPO to maximize the reward model’s score while staying close to the SFT model via a KL divergence penalty.The reward model is the weakest link, and its failure modes are well-documented. First, reward hacking: the policy discovers patterns that score highly with the reward model but are not genuinely good responses. A classic example is the model learning to produce longer, more verbose answers because the reward model was trained on data where humans tended to prefer longer responses. The model exploits this spurious correlation rather than actually being more helpful. The KL penalty mitigates this but does not eliminate it.Second, distribution shift: the reward model was trained on outputs from the SFT model, but during PPO training, the policy drifts into regions of output space the reward model has never seen. The reward model’s predictions in these out-of-distribution regions are unreliable, which can amplify reward hacking.Third, annotation inconsistency: different human annotators disagree on preferences, sometimes substantially. The reward model learns an average of these inconsistent signals, which can produce a reward landscape that does not match any individual human’s preferences well. This is particularly problematic for subjective or culturally dependent preferences.In practice, DPO (Direct Preference Optimization) has emerged as a simpler alternative that skips the reward model entirely and directly optimizes the policy on preference data. It avoids the reward hacking failure mode but has its own limitations — it cannot easily incorporate real-time reward signals or non-pairwise preference data.Follow-up: You notice that after RLHF, your model has become sycophantic — it agrees with the user even when the user is factually wrong. How do you diagnose and fix this?Sycophancy is one of the most common RLHF failure modes. The diagnosis starts with targeted evaluation: create a benchmark of prompts where the user states an incorrect fact and asks the model to agree. Measure how often the post-RLHF model agrees versus the pre-RLHF (SFT) model. If agreement rate increased, RLHF is the cause.The root cause is usually in the preference data: annotators rated agreeable responses higher than disagreeable ones, even when disagreement was factually correct. The fix involves multiple interventions: (1) curate “constitutional” preference pairs where the correct ranking explicitly rewards polite disagreement over sycophantic agreement, (2) add a factuality reward signal (using a separate fact-checking model) alongside the preference-based reward, (3) increase the KL penalty to keep the model closer to the SFT baseline where sycophancy was less pronounced. Anthropic’s Constitutional AI approach addresses this by having the model critique its own outputs against explicit principles before the reward model is trained.
Strong Answer:Emergent capabilities are abilities that appear in larger models but are absent in smaller ones — not as a gradual improvement, but as a sharp transition. The canonical examples include in-context few-shot learning (appearing around 10B parameters), chain-of-thought reasoning (around 100B), and multi-step tool use (around 500B+). The term draws an analogy to phase transitions in physics, like water freezing at a critical temperature.The debate centers on a 2023 paper by Schaeffer et al. that argued emergence is largely an artifact of the metrics used to measure it. Their key insight: when you measure performance using exact-match accuracy (binary: the answer is either exactly right or completely wrong), you see a sharp jump from 0% to near-100% as scale increases. But when you measure using continuous metrics like token-level log-likelihood, the improvement is smooth and predictable across all scales. The “emergence” is in the metric, not the model.Their argument is compelling for many benchmarks: if a model’s probability of generating the correct answer gradually increases from 1% to 30% to 70%, exact-match accuracy stays at 0% until the model consistently gets every token right, then jumps to near-100%. The underlying capability was improving all along, but the discontinuous metric hid the gradual progress.However, there are genuine counterexamples where the capability itself seems qualitatively new, not just quantitatively better. Multi-step arithmetic, compositional generalization to novel combinations, and self-correction behaviors do appear to require a minimum scale threshold even under continuous metrics. The honest answer is that “emergence” is partly real and partly measurement artifact, and the field has not fully disentangled the two.For practitioners, the practical takeaway is: do not assume your model will gain a specific capability by scaling up. Test explicitly at your target scale, and use continuous evaluation metrics whenever possible to get a more accurate picture of where your model actually is on the capability curve.Follow-up: How would you design an evaluation to determine whether a specific capability in your model is truly emergent versus gradually improving?I would test the model at multiple scales (1B, 3B, 7B, 13B, 70B) using both discrete and continuous metrics for the same underlying task. For example, if testing arithmetic: exact-match accuracy on the full equation (discrete), character-level edit distance from the correct answer (continuous), and per-digit accuracy (semi-continuous). If the discrete metric shows a sharp jump but the continuous metrics show smooth improvement, the “emergence” is a measurement artifact. If both metrics show a sharp phase transition at the same scale, the emergence is more likely real. I would also control for confounds like training data composition and tokenization differences across model sizes, since these can create false emergence signals.