Supervised learning requires a “right answer” for every training example. But what if the right answer is subjective (“Is this response helpful?”), non-differentiable (“Does this code compile?”), or only available after a long sequence of actions (“Did we win the chess game?”)? This is where reinforcement learning enters the deep learning toolkit.RL provides a framework for:
Learning from human preferences (RLHF) — this is how ChatGPT, Claude, and other AI assistants are aligned with human values
Optimizing non-differentiable objectives — BLEU score, compilation success, user engagement
Aligning AI systems with human values — teaching models to be helpful, harmless, and honest
Training agents that interact with environments — robotics, game playing, autonomous systems
Why this chapter matters for DL practitioners: Even if you never build an RL agent from scratch, understanding RLHF and DPO is essential for anyone working with modern language models. These techniques are the “secret sauce” that transforms a base model that merely predicts text into an assistant that follows instructions and behaves safely.
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as npfrom typing import List, Tuple, Optional, Dictfrom dataclasses import dataclasstorch.manual_seed(42)
REINFORCE is the simplest policy gradient method, and understanding it is essential before tackling PPO or RLHF. The core idea: sample actions from the policy, observe the reward, and increase the probability of actions that led to high reward while decreasing those that led to low reward. Think of it like training a dog — you cannot tell it what to do in advance, but you can reward good behavior after the fact and the dog gradually learns to repeat those behaviors.
REINFORCE’s fatal flaw — variance: Because rewards are sampled from a stochastic environment, the gradient estimates are extremely noisy. A single lucky trajectory can dominate the gradient, causing the policy to overfit to one good experience. This is why REINFORCE needs a baseline (typically the value function V(s)) to reduce variance. Without a baseline, REINFORCE is nearly unusable for anything beyond toy problems. PPO was invented precisely to address these stability issues.
class REINFORCE: """Basic policy gradient algorithm.""" def __init__( self, policy: PolicyNetwork, lr: float = 1e-3, gamma: float = 0.99 ): self.policy = policy self.optimizer = torch.optim.Adam(policy.parameters(), lr=lr) self.gamma = gamma def compute_returns(self, rewards: List[float]) -> torch.Tensor: """Compute discounted returns.""" returns = [] R = 0 for r in reversed(rewards): R = r + self.gamma * R returns.insert(0, R) returns = torch.tensor(returns) returns = (returns - returns.mean()) / (returns.std() + 1e-8) return returns def update(self, trajectories: List[Dict]): """Update policy from collected trajectories.""" total_loss = 0 for trajectory in trajectories: log_probs = trajectory['log_probs'] rewards = trajectory['rewards'] returns = self.compute_returns(rewards) # Policy gradient loss policy_loss = 0 for log_prob, R in zip(log_probs, returns): policy_loss -= log_prob * R total_loss += policy_loss # Optimize self.optimizer.zero_grad() (total_loss / len(trajectories)).backward() self.optimizer.step() return total_loss.item() / len(trajectories)
PPO is the workhorse of modern RL, used in RLHF and many robotics applications. Its genius is in the name: “proximal” means it prevents the policy from changing too much in a single update. Without this constraint, RL training is notoriously unstable — one bad update can catastrophically degrade the policy. PPO clips the update ratio so the new policy never strays too far from the old one, providing a “trust region” for safe optimization.
RLHF is how models like ChatGPT are aligned with human preferences. The three-stage pipeline below might seem complex, but each stage addresses a fundamental limitation: SFT gives the model basic instruction-following ability, the reward model captures nuanced human preferences that are hard to express as rules, and PPO fine-tuning optimizes the model against those preferences while preventing it from drifting too far from its original capabilities.
Reward hacking is the biggest risk in RLHF: The policy will exploit any flaw in the reward model. If the reward model assigns high scores to verbose responses, the policy will learn to be verbose — even when brevity would be more helpful. If the reward model prefers confident-sounding text, the policy will hallucinate confidently. The KL penalty against the reference model is your primary defense: it prevents the policy from moving to regions of output space where the reward model is unreliable. Monitor reward model accuracy on held-out data throughout training. If the policy’s average reward score rises dramatically but actual quality (as judged by humans) plateaus, you are likely seeing reward hacking.
1. Supervised Fine-Tuning (SFT) └── Train on high-quality demonstrations2. Reward Model Training └── Train a model to predict human preferences3. RL Fine-Tuning (PPO) └── Optimize policy to maximize reward model
DPO is one of the most elegant simplifications in recent ML research. The RLHF pipeline requires training three separate models (SFT model, reward model, policy model) and running PPO, which is notoriously fiddly to tune. DPO showed that you can derive a closed-form loss that directly optimizes the same objective as RLHF, but using only supervised learning on preference pairs. No reward model. No PPO. No RL instability.The trade-off: DPO is offline (it learns from a fixed dataset of preferences) while PPO is online (it generates new responses and gets fresh rewards). For simple, single-turn tasks, DPO often matches or beats PPO. For complex, multi-turn scenarios or when you need to iterate on the reward, PPO’s online nature gives it an edge.
The beta parameter is critical: In DPO, beta controls how far the optimized policy can diverge from the reference model. Too low (e.g., 0.01) and the policy barely changes — safe but ineffective. Too high (e.g., 1.0) and the policy overfits to the preference dataset, potentially losing general capabilities. Most practitioners find beta in the range 0.1-0.5 works well. Start at 0.1 and increase if the reward margin between chosen and rejected responses is too small.
class GRPO: def compute_group_advantages(self, rewards, group_size): # For each prompt, generate group_size responses # Compute advantages relative to group mean # No value function needed!
Exercise 2: Reward Model Ensemble
Train an ensemble of reward models for more robust preferences:
class RewardEnsemble: def __init__(self, models): self.models = models def predict(self, x): rewards = [m(x) for m in self.models] return torch.stack(rewards).mean(dim=0) def uncertainty(self, x): # Use disagreement as uncertainty
Exercise 3: Implement IPO
Implement Identity Preference Optimization:
# IPO loss is simpler than DPOdef ipo_loss(chosen_logps, rejected_logps, ref_chosen, ref_rejected): h_chosen = chosen_logps - ref_chosen h_rejected = rejected_logps - ref_rejected return ((h_chosen - h_rejected - 1/beta) ** 2).mean()
Compare PPO-based RLHF with DPO. What are the trade-offs for a production alignment pipeline?
Strong Answer:PPO-based RLHF is a three-stage pipeline: SFT, reward model training, then PPO optimization against the reward model with a KL penalty. DPO collapses the last two stages by directly optimizing the policy on preference pairs, treating the LM itself as an implicit reward model.PPO advantages: it can optimize any reward signal — pairwise preferences, rule-based rewards (toxicity filters), length penalties, or tool execution feedback. It supports online data collection during training for continuous improvement.DPO advantages: dramatically simpler (no reward model, no value function, no PPO hyperparameter tuning), more stable (no reward hacking), and roughly 2x cheaper compute.DPO limitations: only trains on offline preference data and assumes preference data came from a similar policy.For production: start with DPO for simplicity. Switch to PPO only if you need online reward signals or external tool feedback.Follow-up: How do you detect reward hacking during RLHF training?Monitor three signals: (1) reward score should plateau, not climb indefinitely; (2) periodic human evaluation — if reward rises but human ratings stall, that is reward hacking; (3) output length and repetition — hacking often produces verbose, formulaic responses. Fixes include increasing KL penalty, retraining the reward model with adversarial examples, or switching to DPO.
Explain the KL divergence penalty in RLHF. Why is it necessary?
Strong Answer:The KL penalty constrains how far the RLHF policy drifts from the SFT reference. The objective: maximize E[R(x, y)] - beta * KL(pi_RL || pi_SFT).Without it, the policy discovers shortcuts that score high with the imperfect reward model — extreme verbosity, sycophantic agreement, repetition — while losing coherent generation ability. The reward model is an imperfect proxy, and an unconstrained optimizer exploits every imperfection.Beta too low: rapid divergence, degenerate high-reward outputs, mode collapse. Beta too high: barely any change from SFT, wasting the entire RLHF compute budget. Optimal beta is found by sweeping values and measuring human evaluation.A key subtlety: KL is computed per-token, so the model can concentrate changes on specific tokens (like adding safety refusals) while keeping most output close to reference. This is actually desirable for alignment.Follow-up: How does reference policy choice affect the result?Using the raw pretrained model gives more freedom but risks losing SFT improvements. Using the SFT model (standard) preserves quality but limits deviation. Some teams use a mid-SFT checkpoint as reference for a balance between the two. The reference effectively defines the “center of gravity” that the KL penalty pulls toward.
A PM asks why you cannot just fine-tune on good examples instead of doing RLHF. How do you explain the value?
Strong Answer:Sometimes SFT is enough. But RLHF provides three specific capabilities SFT cannot.First, comparative judgments are far easier and cheaper to collect than demonstrations. “Which response is better?” is simpler than “write the perfect response.” You get 10x more data at the same cost with higher quality.Second, SFT suffers from mode averaging — two valid but different training responses to the same prompt get averaged into a bland compromise. RLHF learns to commit to one coherent style because the reward model can distinguish coherence from wishy-washy averaging.Third, SFT optimizes token-by-token and cannot directly optimize holistic response properties like helpfulness, consistency, or self-contradiction. RLHF assigns a single reward to the full response, enabling sequence-level optimization.For practical applications like customer service bots or structured extraction, SFT alone is sufficient and dramatically simpler. RLHF shines when you need nuanced alignment that demonstrations cannot capture.Follow-up: How does Constitutional AI reduce the annotation burden?Constitutional AI replaces human preference labeling with AI-generated feedback. Define principles, have the model critique its own response pairs against those principles, and train the reward model on AI-generated preferences. This leverages the insight that large models are better judges than generators, dramatically reducing annotator costs while producing well-aligned models.