A language model trained on the internet can write poetry, generate code, and answer trivia. But ask it a simple question like "How do I pick a lock?" and it will cheerfully explain, because its training objective was predict the next token, not be helpful and safe.
The gap between "good at predicting text" and "good at following human intent" is the alignment problem. Closing that gap is what RLHF (Reinforcement Learning from Human Feedback) is about, and it's arguably the most consequential technique in modern AI. It transformed GPT-3 (impressive but unreliable) into ChatGPT (useful and mostly safe). It turned a text completion engine into an assistant.
This post walks through the entire RLHF pipeline, from the math to the code, with interactive diagrams at every stage. We'll cover:
- Why pre-training alone isn't enough
- Supervised Fine-Tuning (SFT): learning from demonstrations
- Reward Modeling: learning human preferences
- PPO: the RL optimization loop
- DPO: skipping the reward model entirely
- Constitutional AI: using AI feedback instead of human feedback
- The practical challenges that make alignment hard
Let's begin.
Part I: The Alignment Problem
What pre-training actually optimizes
A pre-trained language model maximizes the log-likelihood of the training corpus:
This objective has no notion of "helpful," "harmless," or "honest." The model learns to be a statistical parrot of its training data. If the training data contains toxic content, the model reproduces toxic content. If the data contains contradictions, the model contradicts itself. The probability of a response is determined by how likely it is to appear on the internet, not by how good it is as an answer.
Three failure modes
1. Helpfulness failure. The model might respond to "Write me a Python function to sort a list" with a Wikipedia article about sorting algorithms instead of actual code. Both are plausible continuations of the text; the model has no preference for the useful one.
2. Safety failure. The model might provide detailed instructions for dangerous activities. The internet contains such information, so the model learned to produce it.
3. Honesty failure. The model might hallucinate facts with perfect confidence. It was never trained to say "I don't know"; that phrase rarely appears in its training data as a response to questions.
The solution: learn from human feedback
The insight behind RLHF is simple: if we can't write down a perfect loss function for "good behavior," we can train a model to approximate human judgment, then optimize against it.
This happens in three stages, shown in the interactive pipeline below.
The pipeline was introduced by Ouyang et al. (2022) in the InstructGPT paper and subsequently used (with variations) by ChatGPT, Claude, Gemini, and nearly every major language model deployed today.
Part II: Supervised Fine-Tuning (SFT)
The first alignment step
Before any RL happens, we need a model that can at least follow basic instructions. SFT achieves this by fine-tuning the pre-trained model on a dataset of (prompt, high-quality response) pairs written by human demonstrators.
The loss function is identical to pre-training (next-token prediction), but the data is curated:
where is the prompt and is the human-written demonstration. The key difference from pre-training: the data consists of examples that represent the desired behavior, not just any text from the internet.
What SFT teaches
SFT teaches the model:
- Format: How to structure responses (use headers, bullet points, code blocks)
- Tone: Be conversational, helpful, and direct
- Task compliance: Actually answer the question being asked
- Safety basics: Refuse clearly harmful requests
SFT in code
The limits of SFT
SFT has a fundamental ceiling: it can only be as good as the demonstrations. Human demonstrators disagree on what constitutes a "good" response. Some write verbose answers; others prefer concise ones. Some are experts; others make mistakes.
More critically, SFT trains on a binary signal: this response is good enough to include in the dataset. It cannot express degrees of quality. It cannot say "this response is good, but this other one is better." For that, we need a reward model.
Part III: Reward Modeling
The preference learning problem
Instead of asking humans to write perfect responses (expensive and inconsistent), we ask a much easier question: given two responses, which one is better?
This is the core insight of reward modeling. Pairwise comparisons are:
- Cheaper: Comparing takes seconds; writing takes minutes
- More consistent: Humans agree more on relative quality than absolute quality
- More scalable: One annotator can label 50+ comparisons per hour
The data format
Each training example is a triple :
- : the prompt
- : the preferred (winning) response
- : the dispreferred (losing) response
The responses typically come from the SFT model itself. We sample multiple responses per prompt, then have humans rank them.
The Bradley-Terry model
We model preferences using the Bradley-Terry model, a classic framework from the 1950s originally developed for sports rankings. The probability that response is preferred over is:
where is the sigmoid function and is a scalar reward that the model assigns to response given prompt .
The intuition: if the reward model assigns a much higher score to than , the sigmoid pushes the probability close to 1 (high confidence that is better). If the scores are close, the probability is near 0.5 (uncertain).
The reward model loss
We train the reward model by maximizing the log-likelihood of the observed preferences:
This is a binary cross-entropy loss. When the reward model correctly assigns a higher reward to the preferred response (large positive ), the loss is small. When it gets the ranking wrong, the loss is large.
Interactive: See the reward model in action
Try voting on which response you think is better, then see how the reward model scores them.
Reward model architecture
The reward model is typically initialized from the SFT model. The only architectural change: replace the language modeling head (which outputs a vocabulary-sized vector) with a scalar head (which outputs a single number).
Training the reward model
Reward model quality matters enormously
The reward model is the entire source of training signal for the RL phase. If the reward model has systematic biases (e.g., it prefers longer responses regardless of quality), the policy will exploit those biases. This is the root cause of reward hacking, which we'll discuss later.
In practice, InstructGPT's reward model achieved about 72% agreement with held-out human labels. That's far from perfect, but sufficient to guide useful RL training. The accuracy varies significantly by category: factual questions are easier to judge than creative writing.
Part IV: PPO - The RL Training Loop
The optimization objective
With a trained reward model , we can now optimize the language model's policy to produce responses that score highly. The objective is:
Two terms:
- Reward maximization: Generate responses that the reward model scores highly
- KL penalty: Don't stray too far from the reference policy (the SFT model)
The coefficient controls the tradeoff. Too small, and the model exploits reward model bugs. Too large, and the model barely changes from SFT.
Why KL divergence?
Without the KL penalty, the policy would find degenerate solutions: responses that trick the reward model into giving high scores without actually being good. For example:
- Repeating the word "great" hundreds of times (some reward models score this highly)
- Producing responses in a bizarre format that happens to score well
- Mode-collapsing to a single "template" response for all prompts
The KL divergence between two distributions over token sequences is:
In practice, for autoregressive models, this decomposes into a per-token KL:
PPO: Proximal Policy Optimization
PPO (Schulman et al., 2017) is the standard algorithm used for the RL phase. It's popular because it's relatively stable and sample-efficient compared to other policy gradient methods.
The core idea: update the policy to increase the probability of actions with positive advantage, but clip the update to prevent catastrophically large changes.
The policy gradient
The basic policy gradient theorem says:
where is the advantage, measuring how much better action was compared to the expected value at state . In the RLHF context:
- "State" is the prompt plus tokens generated so far
- "Action" is the next token chosen
- "Advantage" comes from the reward model score and a learned value function
The clipped objective
Vanilla policy gradients can take steps that are too large, destroying the policy. PPO prevents this with a clipped objective:
where is the probability ratio between the new and old policies, and is the clipping parameter (typically 0.2).
What the clipping does: If the advantage is positive (good action), we want to increase the probability ratio , but we cap it at . If the advantage is negative (bad action), we want to decrease , but we cap it at . This prevents any single update from changing the policy too drastically.
The full PPO-RLHF objective
Combining the reward, KL penalty, and PPO clipping:
This modified reward is what PPO maximizes. The per-token KL penalty acts as a regularizer at every generation step, not just at the sequence level.
Interactive: Watch the PPO training loop
Step through the PPO cycle and watch how reward, KL divergence, and loss evolve during training.
PPO implementation
Here's a simplified but complete PPO training step for language model alignment:
The four models in memory
During PPO training, four models must be loaded simultaneously:
| Model | Role | Trainable? |
|---|---|---|
| Policy | Generates responses; being optimized | Yes |
| Reference | Anchors the KL penalty | No (frozen) |
| Reward model | Scores response quality | No (frozen) |
| Value model | Estimates expected future reward | Yes |
For a 7B parameter model, each copy requires ~14 GB in fp16. Four copies means ~56 GB just for model weights, before accounting for activations and gradients. This is why RLHF training typically requires multiple high-end GPUs.
Part V: DPO - Direct Preference Optimization
The key insight
Rafailov et al. (2023) asked a brilliant question: what if we could skip the reward model and RL entirely?
The answer comes from a mathematical observation. The optimal policy under the RLHF objective (reward maximization + KL penalty) has a closed-form solution:
where is the partition function.
Deriving the DPO loss
This closed-form solution means we can solve for the reward in terms of the optimal policy:
Now substitute this into the Bradley-Terry preference model:
The partition function cancels out (it depends only on the prompt, not the response):
This gives us the DPO loss:
What DPO actually computes
The term is the implicit reward. It measures how much the current policy has diverged from the reference on this specific response. If assigns much higher probability to than does, the implicit reward is high.
The DPO loss says: increase the implicit reward for preferred responses and decrease it for dispreferred ones. The KL constraint is baked in automatically, because the implicit reward is defined relative to the reference policy.
RLHF vs DPO: side by side
DPO implementation
DPO is dramatically simpler to implement than PPO:
DPO advantages and limitations
Advantages:
- No reward model to train or maintain
- No RL instabilities (PPO is notoriously finicky)
- ~50% less GPU memory (2 models instead of 4)
- Simpler codebase; uses a standard supervised training loop
- The loss is well-defined and easy to debug
Limitations:
- Requires preference data to be representative of the desired behavior
- Less flexible than RLHF with an explicit reward model (can't easily do online learning)
- The parameter is sensitive; too small leads to mode collapse, too large leads to no learning
- Cannot leverage reward models for data filtering or best-of-N sampling at inference time
- Some evidence that PPO produces stronger results at very large scale
Variants of DPO
Several follow-up works have improved on the original DPO formulation:
IPO (Identity Preference Optimization): Replaces the sigmoid loss with a squared loss that avoids the overfitting problems of DPO when the preference data is deterministic.
KTO (Kahneman-Tversky Optimization): Doesn't require pairwise comparisons at all. Instead, it works with individual examples labeled as "good" or "bad." Based on prospect theory from behavioral economics.
ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization into a single stage by adding a preference penalty directly to the SFT loss.
Part VI: Constitutional AI & RLAIF
The human bottleneck
Human feedback is expensive and slow. A single comparison label costs $0.50-2.00 and requires trained annotators. Training a strong reward model needs 50,000-100,000+ comparisons. This creates a bottleneck: the model can only be as aligned as the budget allows.
Using AI feedback
Bai et al. (2022) proposed Constitutional AI (CAI), which replaces human feedback with AI feedback in two stages:
Stage 1: Self-Critique and Revision. Given a harmful response, ask the model to critique its own response against a set of principles (the "constitution"), then revise it.
Stage 2: RLAIF (RL from AI Feedback). Instead of human annotators comparing responses, use a language model to compare them. The AI evaluator is prompted with the constitutional principles and asked which response better adheres to them.
The constitution
The "constitution" is a set of principles like:
- Be helpful, harmless, and honest
- Don't assist with illegal activities
- Acknowledge uncertainty rather than hallucinating
- Respect privacy and consent
- Avoid generating explicit or violent content
These principles are provided to the AI critic as part of its system prompt. The specific principles can be updated without retraining; just modify the critic's prompt.
RLAIF in practice
The RLAIF pipeline is nearly identical to RLHF, with one substitution:
| Step | RLHF | RLAIF |
|---|---|---|
| 1. SFT | Human demonstrations | Human demonstrations |
| 2. Preferences | Human comparisons | AI comparisons |
| 3. Reward model | Trained on human prefs | Trained on AI prefs |
| 4. RL | PPO against reward model | PPO against reward model |
Google's research showed that RLAIF achieves comparable performance to RLHF on many benchmarks, and can even exceed it when the AI labeler is sufficiently capable. The key finding: AI preferences are more consistent (less noisy) than human preferences, which can actually lead to a better reward model.
Self-play and iterated refinement
A natural extension: use the aligned model to generate AI feedback, then use that feedback to train an even more aligned model, then repeat. This iterated RLAIF process can bootstrap from a weak initial model to increasingly capable alignment, though it risks amplifying any systematic biases in the AI evaluator.
Part VII: Practical Challenges
Reward hacking
The most pernicious problem in RLHF. Reward hacking occurs when the policy finds inputs that score highly according to the reward model but are not actually good responses.
Common examples:
- Length gaming: The reward model slightly prefers longer responses, so the policy generates extremely verbose answers with padding and repetition
- Style exploitation: The model learns to use confident, authoritative language regardless of whether the content is correct
- Sycophancy: The model agrees with whatever the user says, even when the user is wrong, because agreement tends to score higher
- Formatting tricks: Excessive use of bullet points, headers, or markdown that reward models rate highly
The KL penalty mitigates but doesn't eliminate reward hacking. The fundamental issue is that the reward model is an imperfect proxy for human judgment. Any optimization against an imperfect proxy will eventually find the gaps.
Goodhart's Law applies directly: "When a measure becomes a target, it ceases to be a good measure."
Mitigation strategies:
- Ensemble reward models: Use multiple reward models and take the minimum score
- Reward model regularization: Penalize extreme reward values
- KL budget: Set a hard limit on KL divergence, not just a soft penalty
- Periodic re-labeling: Have humans evaluate the policy's outputs and retrain the reward model
Mode collapse
The policy may converge to producing the same response (or a small set of responses) for every prompt. This happens when the reward model strongly prefers one style and the KL penalty is insufficient to maintain diversity.
Symptoms:
- Entropy of the policy drops to near zero
- All responses start with the same preamble
- The model ignores prompt variations
Solutions:
- Increase the KL coefficient
- Add an entropy bonus to the PPO objective
- Use diverse prompt batches during training
- Monitor generation diversity as a training metric
Evaluation difficulties
How do you know if alignment is working? Unlike classification accuracy or perplexity, there's no single number that captures "alignment quality."
Common evaluation approaches:
Human evaluation: The gold standard but expensive and slow. Typically done as A/B tests: show humans outputs from two models and ask which is better. Requires careful calibration to avoid annotator bias.
Automated benchmarks:
- MT-Bench: Multi-turn conversation quality judged by GPT-4
- AlpacaEval: Single-turn instruction following with automated length-controlled win rates
- TruthfulQA: Tests for hallucination on adversarial questions
- BBQ: Measures social biases across demographic groups
Red teaming: Adversarial probing by humans (or other models) to find failure modes. Essential but difficult to systematize.
The fundamental challenge: alignment is multi-dimensional. A model can be helpful but unsafe. Safe but unhelpful. Honest but harsh. Improving one dimension often trades off against another.
The RLHF tax
RLHF typically reduces the model's raw capabilities (as measured by benchmarks like MMLU or HumanEval) while improving its instruction-following and safety. This is sometimes called the "alignment tax." The magnitude varies but is typically 1-5% on capability benchmarks.
This tradeoff is generally considered worthwhile: a slightly less capable model that follows instructions and avoids harm is far more useful in practice than a more capable model that ignores instructions and generates harmful content.
Part VIII: Putting It All Together
The complete training recipe
Here is the full pipeline for training an aligned language model, as practiced at major labs:
Phase 0: Pre-training
- Train on trillions of tokens of internet text
- Objective: next-token prediction
- Duration: weeks to months on thousands of GPUs
- Output: base model (e.g., GPT-4-base, Llama-base)
Phase 1: Supervised Fine-Tuning
- Dataset: 10,000-100,000 (prompt, response) pairs from expert annotators
- Objective: next-token prediction on demonstrations
- Duration: hours to days
- Output: SFT model
Phase 2: Reward Model Training
- Dataset: 50,000-500,000 pairwise comparisons
- Objective: Bradley-Terry preference loss
- Duration: hours
- Output: Reward model
Phase 3: RL Optimization
- Algorithm: PPO or DPO
- Objective: Maximize reward with KL constraint
- Duration: days
- Output: Aligned model
Phase 4: Safety Fine-Tuning
- Additional round of RLHF/DPO focused specifically on safety
- Red team evaluation and iterative refinement
- Constitutional AI principles for scalable oversight
What's next
The field is moving rapidly. Some active research directions:
Process reward models: Instead of scoring complete responses, score each step of reasoning individually. This provides denser training signal and catches errors earlier in the chain of thought.
RLHF at scale: As models get larger, the computational cost of RLHF grows. Research into more efficient algorithms (like online DPO variants) is ongoing.
Multi-objective alignment: Current RLHF collapses all human preferences into a single scalar reward. Future systems may maintain separate reward models for helpfulness, safety, honesty, and other dimensions, then use multi-objective optimization.
Scalable oversight: As models become more capable than their human evaluators, how do we ensure the feedback is still meaningful? This is one of the deepest open problems in AI safety.
Summary
RLHF and its variants are the bridge between "language model" and "AI assistant." The math is elegant (the Bradley-Terry model, the closed-form DPO solution), but the engineering is where the real difficulty lies. Reward hacking, mode collapse, evaluation, and scalability remain active challenges.
The key equations to remember:
Reward model loss:
RLHF objective:
DPO loss:
PPO clipped objective:
The path from raw language model to aligned assistant is long, expensive, and imperfect. But it works. And it's getting better.
References
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
- Schulman et al., "Proximal Policy Optimization Algorithms" (PPO, 2017)
- Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
- Bai et al., "Constitutional AI: Harmlessness from AI Feedback" (CAI, 2022)
- Christiano et al., "Deep reinforcement learning from human preferences" (2017)
- Stiennon et al., "Learning to summarize from human feedback" (2020)
- Ziegler et al., "Fine-Tuning Language Models from Human Preferences" (2019)
- Bradley & Terry, "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons" (1952)
- Azar et al., "A General Theoretical Paradigm to Understand Learning from Human Feedback" (IPO, 2023)
- Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization" (2024)
- Hong et al., "ORPO: Monolithic Preference Optimization without Reference Model" (2024)
