What Is RLHF?
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique used to align the behavior of large language models with human preferences. Rather than training purely on text prediction, RLHF incorporates a signal from human raters who compare pairs of model outputs and indicate which they prefer.
RLHF is the key technique behind the transformation of raw language models into helpful, harmless, and honest assistants — it is central to how models like ChatGPT, Claude, and Gemini are trained.
The Three-Stage RLHF Pipeline
Stage 1: Supervised Fine-tuning (SFT)
A base pre-trained model is fine-tuned on a dataset of high-quality prompt/response pairs written or curated by human labelers. This teaches the model the general format and tone of helpful responses.
Stage 2: Reward Model Training
Human labelers are shown multiple model-generated responses to the same prompt and rank them from best to worst. These rankings train a separate reward model — a classifier that scores any given response on a scale from "bad" to "good" according to the learned human preference signal.
Prompt: "Explain recursion to a 10-year-old."
Response A: [technical jargon explanation] → labeler prefers B
Response B: [story-based analogy] → rated higher
The reward model generalizes from thousands of such comparisons.
Stage 3: Policy Optimization with PPO
The SFT model (now called the policy) is updated using Proximal Policy Optimization (PPO), a reinforcement learning algorithm. The policy generates responses, the reward model scores them, and PPO adjusts the policy's weights to maximize expected reward — while a KL-divergence penalty prevents the model from drifting too far from the SFT baseline (which would cause reward hacking).
Why RLHF Is Difficult
- Reward hacking — Models learn to game the reward model by producing responses that score well but are not genuinely helpful (e.g., verbose flattery).
- Labeler disagreement — Different humans have different values, leading to noisy preference signals.
- Scalability — Human annotation is expensive and slow; scaling to billions of comparisons is infeasible.
- Mode collapse — PPO can cause the model to converge on a narrow range of response styles.
Alternatives and Successors
| Method | Key idea |
|---|---|
| DPO (Direct Preference Optimization) | Eliminates the explicit reward model; trains directly on preference pairs. Simpler and more stable. |
| RLAIF (RL from AI Feedback) | Uses another LLM (e.g., Claude) as the rater instead of humans, reducing cost. |
| Constitutional AI (CAI) | Anthropic's approach: the model critiques and revises its own outputs using a set of principles. |
| GRPO | Group Relative Policy Optimization — used in DeepSeek-R1, optimizes over groups of responses. |
RLHF and Knowledge Quality
RLHF shapes how a model responds — its tone, safety, and helpfulness. But it does not update what the model knows. An RLHF-aligned model will politely and helpfully hallucinate outdated product information just as readily as an unaligned one.
Grounding aligned models with accurate, fresh knowledge — using tools like KnowledgeSDK to extract and index structured web content — is the complementary layer that addresses the knowledge gap RLHF cannot fix.
Key Papers
- "Training language models to follow instructions with human feedback" — Ouyang et al., 2022 (InstructGPT)
- "Learning to summarize from human feedback" — Stiennon et al., 2020
- "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" — Rafailov et al., 2023