RLHF

Reinforcement Learning from Human Feedback — a training technique that uses human preference ratings to align LLM outputs with human values.

What Is RLHF?

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique used to align the behavior of large language models with human preferences. Rather than training purely on text prediction, RLHF incorporates a signal from human raters who compare pairs of model outputs and indicate which they prefer.

RLHF is the key technique behind the transformation of raw language models into helpful, harmless, and honest assistants — it is central to how models like ChatGPT, Claude, and Gemini are trained.

The Three-Stage RLHF Pipeline

Stage 1: Supervised Fine-tuning (SFT)

A base pre-trained model is fine-tuned on a dataset of high-quality prompt/response pairs written or curated by human labelers. This teaches the model the general format and tone of helpful responses.

Stage 2: Reward Model Training

Human labelers are shown multiple model-generated responses to the same prompt and rank them from best to worst. These rankings train a separate reward model — a classifier that scores any given response on a scale from "bad" to "good" according to the learned human preference signal.

Prompt: "Explain recursion to a 10-year-old."
Response A: [technical jargon explanation]  → labeler prefers B
Response B: [story-based analogy]           → rated higher

The reward model generalizes from thousands of such comparisons.

Stage 3: Policy Optimization with PPO

The SFT model (now called the policy) is updated using Proximal Policy Optimization (PPO), a reinforcement learning algorithm. The policy generates responses, the reward model scores them, and PPO adjusts the policy's weights to maximize expected reward — while a KL-divergence penalty prevents the model from drifting too far from the SFT baseline (which would cause reward hacking).

Why RLHF Is Difficult

Reward hacking — Models learn to game the reward model by producing responses that score well but are not genuinely helpful (e.g., verbose flattery).
Labeler disagreement — Different humans have different values, leading to noisy preference signals.
Scalability — Human annotation is expensive and slow; scaling to billions of comparisons is infeasible.
Mode collapse — PPO can cause the model to converge on a narrow range of response styles.

Alternatives and Successors

Method	Key idea
DPO (Direct Preference Optimization)	Eliminates the explicit reward model; trains directly on preference pairs. Simpler and more stable.
RLAIF (RL from AI Feedback)	Uses another LLM (e.g., Claude) as the rater instead of humans, reducing cost.
Constitutional AI (CAI)	Anthropic's approach: the model critiques and revises its own outputs using a set of principles.
GRPO	Group Relative Policy Optimization — used in DeepSeek-R1, optimizes over groups of responses.

RLHF and Knowledge Quality

RLHF shapes how a model responds — its tone, safety, and helpfulness. But it does not update what the model knows. An RLHF-aligned model will politely and helpfully hallucinate outdated product information just as readily as an unaligned one.

Grounding aligned models with accurate, fresh knowledge — using tools like KnowledgeSDK to extract and index structured web content — is the complementary layer that addresses the knowledge gap RLHF cannot fix.

Key Papers

"Training language models to follow instructions with human feedback" — Ouyang et al., 2022 (InstructGPT)
"Learning to summarize from human feedback" — Stiennon et al., 2020
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" — Rafailov et al., 2023

Related Terms

LLMsintermediate

Fine-tuning

The process of further training a pre-trained LLM on a smaller domain-specific dataset to adapt its behavior for a particular task.

LLMsbeginner

Large Language Model

A neural network trained on vast text corpora that can generate, summarize, translate, and reason about language.

← Retrieval-Augmented Generation robots.txt →

Try it now

Build with RLHF using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →

← Back to glossary