Foundation models are large models trained on broad data that can be adapted to many downstream tasks. The analogy is a well-rounded liberal arts education: rather than training a specialist from scratch for every task, you invest heavily in a broad base of knowledge, then specialize with minimal additional effort. GPT-3 was trained once at enormous cost, but it can be adapted to summarization, translation, code generation, and thousands of other tasks with just a few examples — or even zero examples.The term “foundation model” was coined by Stanford’s HAI in 2021, and it captures a paradigm shift: instead of building task-specific models from scratch, you build on top of a pre-existing “foundation” of learned knowledge.Key characteristics:
Scale — billions of parameters, trained on trillions of tokens
Self-supervised pretraining — learns from raw text without human annotation
Emergent capabilities — abilities that appear only at sufficient scale, not explicitly trained
Transfer to diverse tasks — one model, many applications via prompting or fine-tuning
DeepMind’s Chinchilla paper (2022) fundamentally changed how the industry thinks about training LLMs. The key finding: most models before Chinchilla were dramatically undertrained — they were too large for the amount of data they saw.For compute-optimal training:Nopt∝C0.5,Dopt∝C0.5Where:
N = number of parameters
D = dataset size (tokens)
C = compute budget (FLOPs)
Rule of thumb: Train on ~20 tokens per parameter. This means a 7B parameter model should see roughly 140B tokens for compute-optimal training.
Why this matters in practice: Before Chinchilla, the trend was “bigger model = better.” Chinchilla showed that a 70B model trained on 1.4T tokens outperformed the 280B Gopher model trained on 300B tokens — at much lower inference cost. This shifted the industry toward smaller, better-trained models, and directly influenced LLaMA, Mistral, and other efficient model families.
Modern LLMs use several key improvements over the original 2017 Transformer. Each one addresses a specific limitation discovered through years of scaling experiments. Understanding these is essential for reading current research papers and building production systems.
import torchimport torch.nn as nnimport torch.nn.functional as Fclass ModernTransformerBlock(nn.Module): """LLaMA-style transformer block with modern improvements. Three key changes from the original Transformer: 1. Pre-normalization (RMSNorm before attention/MLP, not after) 2. Grouped Query Attention (fewer KV heads = faster inference) 3. SwiGLU activation (empirically better than ReLU/GELU) """ def __init__(self, dim, num_heads, mlp_ratio=4, dropout=0.0): super().__init__() # Pre-normalization with RMSNorm (simpler and faster than LayerNorm) self.norm1 = RMSNorm(dim) self.norm2 = RMSNorm(dim) # Grouped Query Attention: use 4x fewer KV heads than query heads # This dramatically reduces KV cache memory during inference self.attn = GroupedQueryAttention(dim, num_heads, num_kv_heads=num_heads // 4) # SwiGLU MLP: ~1% better than GELU across most benchmarks # The 2/3 factor compensates for SwiGLU having 3 weight matrices instead of 2 self.mlp = SwiGLU(dim, int(dim * mlp_ratio * 2/3)) def forward(self, x, freqs_cis=None): # Pre-norm + residual x = x + self.attn(self.norm1(x), freqs_cis) x = x + self.mlp(self.norm2(x)) return xclass RMSNorm(nn.Module): """Root Mean Square Normalization. Simpler than LayerNorm: skips the mean-centering step. Empirically equivalent in quality but ~10% faster because it avoids computing the mean across the feature dimension. Used in LLaMA, Mistral, and most modern LLMs. """ def __init__(self, dim, eps=1e-6): super().__init__() self.eps = eps self.weight = nn.Parameter(torch.ones(dim)) def forward(self, x): # Compute root mean square (no mean subtraction, unlike LayerNorm) rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps) return x / rms * self.weightclass SwiGLU(nn.Module): """SwiGLU activation (better than ReLU/GELU for LLMs). SwiGLU uses a gating mechanism: one linear projection creates a "gate" that controls how much information flows through another projection. Think of it as the network learning to selectively amplify or suppress different features at each position. Formula: SwiGLU(x) = (Swish(xW1)) * (xW3) then projected by W2 """ def __init__(self, dim, hidden_dim): super().__init__() self.w1 = nn.Linear(dim, hidden_dim, bias=False) # Gate projection self.w2 = nn.Linear(hidden_dim, dim, bias=False) # Down projection self.w3 = nn.Linear(dim, hidden_dim, bias=False) # Up projection def forward(self, x): # F.silu = x * sigmoid(x), also called "Swish" # The gate (silu(w1(x))) controls the flow of the value (w3(x)) return self.w2(F.silu(self.w1(x)) * self.w3(x))
def causal_lm_loss(logits, labels): """Next token prediction loss. The fundamental training objective: at every position, predict the NEXT token. If the sequence is [The, cat, sat], we want: - Position 0 (The) -> predict "cat" - Position 1 (cat) -> predict "sat" This is why we shift logits and labels by one position. """ # Shift: logits at position t predict the token at position t+1 shift_logits = logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous() loss = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), # Flatten to (batch*seq, vocab) shift_labels.view(-1), # Flatten to (batch*seq,) ignore_index=-100 # Ignore padding tokens in loss computation ) return loss
One of the most fascinating (and debated) phenomena in foundation models: as models scale, new abilities appear that were not present at smaller scales. This is not a gradual improvement — it is more like phase transitions in physics, where water suddenly becomes ice at a critical temperature.
Scale
Emergent Capability
~1B
Basic language understanding
~10B
Few-shot learning
~100B
Complex reasoning, code generation
~500B+
Multi-step reasoning, tool use
Important nuance: The concept of “emergence” in LLMs is actively debated. Some researchers argue that emergence is partly an artifact of how we measure performance (e.g., using exact-match accuracy that jumps from 0% to near-100%). When using continuous metrics like log-likelihood, the improvement often looks gradual. Regardless, the practical reality is clear: larger models can do things smaller models cannot, and predicting exactly which capabilities will appear at which scale remains an open problem.
RLHF is how raw language models become useful assistants. The base model can generate fluent text, but it might also be toxic, unhelpful, or dishonest. RLHF aligns the model with human preferences through a three-step process: (1) supervised fine-tuning on demonstrations, (2) training a reward model on human preference rankings, and (3) using RL (typically PPO) to optimize the policy against the reward model while staying close to the original model.The analogy: the base model is a talented but unsocialized intern. SFT teaches them basic professional behavior from examples. RLHF then refines their judgment by showing them pairs of responses and learning which ones humans prefer.
class RewardModel(nn.Module): """Reward model trained on human preferences.""" def __init__(self, base_model): super().__init__() self.base = base_model self.reward_head = nn.Linear(base_model.config.hidden_size, 1) def forward(self, input_ids, attention_mask): outputs = self.base(input_ids, attention_mask=attention_mask) last_hidden = outputs.last_hidden_state[:, -1, :] # Last token return self.reward_head(last_hidden)def compute_preference_loss(reward_model, chosen, rejected): """Bradley-Terry preference model loss.""" reward_chosen = reward_model(**chosen) reward_rejected = reward_model(**rejected) loss = -F.logsigmoid(reward_chosen - reward_rejected).mean() return loss
def ppo_step(policy, ref_policy, reward_model, prompts, kl_coef=0.1): """Single PPO update step. The KL penalty is crucial: without it, the model would quickly learn to "hack" the reward model by generating degenerate text that scores highly but is nonsensical. The KL term acts as a leash, keeping the policy close to the well-behaved reference model. """ # Step 1: Generate responses with current policy responses = policy.generate(prompts) # Step 2: Score responses with the reward model rewards = reward_model(prompts, responses) # Step 3: Compute KL divergence from reference model (the "leash") with torch.no_grad(): ref_logprobs = ref_policy.log_prob(prompts, responses) policy_logprobs = policy.log_prob(prompts, responses) kl = policy_logprobs - ref_logprobs # Per-token KL approximation # Total reward = RM reward minus KL penalty # Higher kl_coef = stronger pull toward reference model = more conservative total_reward = rewards - kl_coef * kl # PPO clipped objective: prevent catastrophically large policy updates ratio = torch.exp(policy_logprobs - old_logprobs) clipped = torch.clamp(ratio, 1 - eps, 1 + eps) loss = -torch.min(ratio * total_reward, clipped * total_reward).mean() return loss
Explain the Chinchilla scaling law and how it changed the industry's approach to training LLMs. Where does the law break down?
Strong Answer:The Chinchilla scaling law, from DeepMind’s 2022 paper, established that for a given compute budget, the optimal strategy is to scale model size and training data equally. The rule of thumb is roughly 20 tokens per parameter for compute-optimal training. This was a paradigm shift because prior to Chinchilla, the dominant approach (exemplified by GPT-3’s 175B parameters trained on only 300B tokens, a ratio of 1.7) was to build the biggest model you could afford and train it on whatever data you had.Chinchilla demonstrated that their 70B model trained on 1.4T tokens (ratio of 20) outperformed the 280B Gopher model on most benchmarks — at roughly 4x lower inference cost. This directly spawned the “smaller but better-trained” movement: LLaMA (7B-65B trained on 1-1.4T tokens), Mistral, and Gemma all follow Chinchilla-optimal or even over-trained ratios.Where it breaks down: the law assumes you care about training compute optimality. In practice, inference cost often dominates total cost of ownership. If you are serving a model to millions of users, a smaller model trained far beyond Chinchilla-optimal (say 7B parameters on 2T+ tokens, ratio of 285) is cheaper to serve even though you “wasted” training compute. LLaMA 3 trained 8B parameters on 15T tokens — a ratio of 1,875 — precisely because inference savings at Meta’s scale vastly outweigh extra training cost. The scaling law also assumes fixed data quality; in practice, data curation and deduplication can shift the optimal ratio significantly.Follow-up: If you were given a fixed GPU budget of 1000 H100-hours to train a model for a specific production task, how would you allocate between model size and training data?I would first estimate the total compute in FLOPs: roughly 1000 hours * 990 TFLOPS (H100 BF16) = about 3.6e21 FLOPs. Using the Chinchilla formula, that suggests roughly a 1.3B parameter model trained on 26B tokens for compute-optimal training. But since this is for production deployment, I would skew toward a smaller model (say 400M parameters) trained on the same token budget, over-training by about 3x. The reasoning: at serving time, a 400M model is 3x cheaper and faster, and the over-training penalty on final loss is modest — maybe 2-5% worse perplexity. I would also invest in high-quality domain-specific data rather than more generic web text, since data quality has been shown to shift the effective training ratio favorably.
Walk me through the key architectural differences between the original 2017 Transformer and a modern LLM like LLaMA. Why was each change made?
Strong Answer:There are four major changes, each addressing a specific limitation discovered through years of scaling.First, pre-normalization with RMSNorm instead of post-normalization with LayerNorm. The original Transformer applied LayerNorm after the residual addition (post-norm). This creates training instability at scale because the residual stream’s magnitude can grow unpredictably. Pre-norm applies normalization before the attention and MLP sublayers, which stabilizes the residual stream and makes training more robust. RMSNorm specifically drops the mean-centering step of LayerNorm, saving about 10% compute with empirically equivalent quality.Second, Grouped Query Attention (GQA) instead of Multi-Head Attention. Standard MHA uses the same number of key-value heads as query heads. GQA uses fewer KV heads (typically 4-8x fewer). This dramatically reduces the KV cache memory during autoregressive inference. For a 70B model serving long sequences, the KV cache can consume 20+ GB; GQA cuts this proportionally. The quality impact is minimal because keys and values are shared across groups of query heads, and empirically the model learns to use this shared structure effectively.Third, SwiGLU activation instead of ReLU or GELU in the feedforward network. SwiGLU uses a gating mechanism where one linear projection acts as a learned gate that modulates another projection. This consistently beats ReLU and GELU by about 1% on downstream benchmarks. The hidden dimension is adjusted by a factor of 2/3 to compensate for SwiGLU having three weight matrices instead of two, keeping the parameter count constant.Fourth, Rotary Position Embeddings (RoPE) instead of absolute sinusoidal or learned position embeddings. RoPE encodes position information by rotating query and key vectors in 2D subspaces. This has a critical advantage: the attention score between two tokens depends only on their relative distance, not their absolute positions. This enables better length generalization — a model trained on 4K context can often extrapolate to 8K or beyond with minimal degradation, which absolute embeddings cannot do.Follow-up: If you had to drop one of these improvements to simplify your implementation, which would you sacrifice and why?I would drop SwiGLU and revert to GELU. The accuracy gain from SwiGLU (roughly 1%) is the smallest of the four changes, while the implementation complexity is non-trivial (three weight matrices instead of two, adjusted hidden dimension). Pre-norm and RoPE are essential for training stability and length generalization respectively, and GQA is essential for practical inference at scale. GELU with a standard two-matrix FFN is simpler, well-understood, and the 1% quality gap can often be closed with slightly more training data.
How does RLHF work at a high level, and what are the known failure modes of the reward model?
Strong Answer:RLHF has three phases. Phase one: supervised fine-tuning (SFT) on high-quality demonstration data teaches the model basic instruction-following behavior. Phase two: a reward model is trained on human preference data — annotators see pairs of model outputs and indicate which they prefer, and the reward model learns to predict these preferences using a Bradley-Terry loss. Phase three: the policy model (the actual LLM) is optimized via PPO to maximize the reward model’s score while staying close to the SFT model via a KL divergence penalty.The reward model is the weakest link, and its failure modes are well-documented. First, reward hacking: the policy discovers patterns that score highly with the reward model but are not genuinely good responses. A classic example is the model learning to produce longer, more verbose answers because the reward model was trained on data where humans tended to prefer longer responses. The model exploits this spurious correlation rather than actually being more helpful. The KL penalty mitigates this but does not eliminate it.Second, distribution shift: the reward model was trained on outputs from the SFT model, but during PPO training, the policy drifts into regions of output space the reward model has never seen. The reward model’s predictions in these out-of-distribution regions are unreliable, which can amplify reward hacking.Third, annotation inconsistency: different human annotators disagree on preferences, sometimes substantially. The reward model learns an average of these inconsistent signals, which can produce a reward landscape that does not match any individual human’s preferences well. This is particularly problematic for subjective or culturally dependent preferences.In practice, DPO (Direct Preference Optimization) has emerged as a simpler alternative that skips the reward model entirely and directly optimizes the policy on preference data. It avoids the reward hacking failure mode but has its own limitations — it cannot easily incorporate real-time reward signals or non-pairwise preference data.Follow-up: You notice that after RLHF, your model has become sycophantic — it agrees with the user even when the user is factually wrong. How do you diagnose and fix this?Sycophancy is one of the most common RLHF failure modes. The diagnosis starts with targeted evaluation: create a benchmark of prompts where the user states an incorrect fact and asks the model to agree. Measure how often the post-RLHF model agrees versus the pre-RLHF (SFT) model. If agreement rate increased, RLHF is the cause.The root cause is usually in the preference data: annotators rated agreeable responses higher than disagreeable ones, even when disagreement was factually correct. The fix involves multiple interventions: (1) curate “constitutional” preference pairs where the correct ranking explicitly rewards polite disagreement over sycophantic agreement, (2) add a factuality reward signal (using a separate fact-checking model) alongside the preference-based reward, (3) increase the KL penalty to keep the model closer to the SFT baseline where sycophancy was less pronounced. Anthropic’s Constitutional AI approach addresses this by having the model critique its own outputs against explicit principles before the reward model is trained.
What are emergent capabilities in foundation models, and why is there active debate about whether 'emergence' is real or an artifact of measurement?
Strong Answer:Emergent capabilities are abilities that appear in larger models but are absent in smaller ones — not as a gradual improvement, but as a sharp transition. The canonical examples include in-context few-shot learning (appearing around 10B parameters), chain-of-thought reasoning (around 100B), and multi-step tool use (around 500B+). The term draws an analogy to phase transitions in physics, like water freezing at a critical temperature.The debate centers on a 2023 paper by Schaeffer et al. that argued emergence is largely an artifact of the metrics used to measure it. Their key insight: when you measure performance using exact-match accuracy (binary: the answer is either exactly right or completely wrong), you see a sharp jump from 0% to near-100% as scale increases. But when you measure using continuous metrics like token-level log-likelihood, the improvement is smooth and predictable across all scales. The “emergence” is in the metric, not the model.Their argument is compelling for many benchmarks: if a model’s probability of generating the correct answer gradually increases from 1% to 30% to 70%, exact-match accuracy stays at 0% until the model consistently gets every token right, then jumps to near-100%. The underlying capability was improving all along, but the discontinuous metric hid the gradual progress.However, there are genuine counterexamples where the capability itself seems qualitatively new, not just quantitatively better. Multi-step arithmetic, compositional generalization to novel combinations, and self-correction behaviors do appear to require a minimum scale threshold even under continuous metrics. The honest answer is that “emergence” is partly real and partly measurement artifact, and the field has not fully disentangled the two.For practitioners, the practical takeaway is: do not assume your model will gain a specific capability by scaling up. Test explicitly at your target scale, and use continuous evaluation metrics whenever possible to get a more accurate picture of where your model actually is on the capability curve.Follow-up: How would you design an evaluation to determine whether a specific capability in your model is truly emergent versus gradually improving?I would test the model at multiple scales (1B, 3B, 7B, 13B, 70B) using both discrete and continuous metrics for the same underlying task. For example, if testing arithmetic: exact-match accuracy on the full equation (discrete), character-level edit distance from the correct answer (continuous), and per-digit accuracy (semi-continuous). If the discrete metric shows a sharp jump but the continuous metrics show smooth improvement, the “emergence” is a measurement artifact. If both metrics show a sharp phase transition at the same scale, the emergence is more likely real. I would also control for confounds like training data composition and tokenization differences across model sizes, since these can create false emergence signals.