Conceptual image of a human overseeing an AI system with safety controls

RLHF and Preference Training

AGAI 302 · Alignment Techniques and Research

Learn how reinforcement learning from human feedback trains models to better follow human preferences, and why preference training has important limitations.

Key terms

RLHF = SFT + reward model + RLpreference ≠ truthreward model = learned human preference proxyDPO = direct training on chosen vs rejected

Learning objectives

Describe the three-stage RLHF pipeline.
Explain how reward models are trained from human comparisons.
Compare PPO-based RLHF with Direct Preference Optimization.
Identify major limitations of preference training.

Reinforcement learning from human feedback, or RLHF, is one of the most influential techniques used to make modern language models more helpful and usable. It helps transform a raw pretrained model into an assistant that follows instructions, refuses certain harmful requests, and produces responses people generally prefer.

The classic RLHF pipeline has three stages:

1. Supervised fine-tuning
2. Reward model training
3. Reinforcement learning optimization

RLHF was central to systems such as InstructGPT and later assistant-style models. It remains an important part of alignment engineering, though newer preference-optimization methods such as DPO are also widely used.

Stage 1: supervised fine-tuning

In supervised fine-tuning, human-written or human-edited demonstrations show the model what good assistant behavior looks like.

Example training item:

{
  "user": "Explain gradient descent to a beginner.",
  "assistant": "Gradient descent is a method for improving a model by repeatedly adjusting its parameters in the direction that reduces error..."
}

The model learns the style and structure of assistant responses. It learns to answer questions, follow instructions, format outputs, and respond to common categories of user requests.

But demonstrations alone are limited. Many possible responses could be reasonable, and it is expensive to write perfect answers for every situation.

Stage 2: reward model training

In reward model training, human raters compare model outputs.

Example:

Prompt: Explain recursion.
Response A: Clear explanation with example and base case.
Response B: Vague explanation with no example.
Human preference: A

The reward model learns to predict which response humans are likely to prefer.

Raters may follow guidelines about helpfulness, honesty, harmlessness, clarity, tone, and instruction following. But raters are still human. Their judgments may be inconsistent, culturally specific, time-limited, or influenced by surface fluency.

Stage 3: reinforcement learning optimization

The assistant model is then optimized to produce responses that the reward model scores highly. Historically, PPO, or Proximal Policy Optimization, was commonly used.

The objective is roughly:

Increase reward model score while staying close to the supervised fine-tuned model.

The “stay close” part matters because unconstrained optimization can exploit the reward model. If the model moves too far from normal language behavior, it may produce strange outputs that score well according to the reward model but are bad for users.

Direct Preference Optimization

Direct Preference Optimization, or DPO, is a simpler alternative to RL-based RLHF. Instead of training a separate reward model and using reinforcement learning, DPO trains directly on preference pairs:

{
  "prompt": "Summarize this policy.",
  "chosen": "A concise, accurate summary...",
  "rejected": "A vague or misleading summary..."
}

The model is updated to make chosen responses more likely than rejected responses. DPO-style methods are attractive because they can be simpler and more stable than PPO pipelines.

Strengths of RLHF

RLHF and preference training improve:

Instruction following
Conversational quality
Refusal behavior
Tone and helpfulness
Formatting
User preference alignment
Reduction of some harmful outputs

They made language models far more usable as assistants.

Limitations of RLHF

RLHF does not solve alignment.

Limitations include:

Sycophancy: models may agree with users too readily.
Surface preference: humans may prefer fluent answers even when they are wrong.
Reward hacking: models may optimize the reward model rather than truth.
Inconsistent preferences: raters may disagree.
Distribution shift: training preferences may not cover novel cases.
Over-refusal: models may refuse safe requests because they resemble unsafe ones.
Under-refusal: models may comply with cleverly framed harmful requests.

For agentic systems, RLHF is especially incomplete. A model may be trained to produce safe text, but an application may give it tools that can take unsafe actions. Tool permissions and system architecture still matter.

Practical implications for builders

When using an RLHF-trained model, do not assume the model is fully aligned. Treat it as a capable component with useful safety training.

You still need:

- Clear system instructions
- Input validation
- Tool permissions
- Retrieval grounding
- Human approval gates
- Monitoring and evaluation
- Red-team testing
- Fallback and escalation paths

RLHF improves the model’s default behavior. Application safety determines what the system can actually do.

Practical takeaway

RLHF and preference training are powerful alignment techniques because they train models toward human-preferred behavior. They have made AI assistants much more useful and safer than raw pretrained models.

But preference is not the same as truth, safety, or full human values. RLHF is one layer of alignment, not the final answer.

Ask your AI guide

AI Chat· AI Safety & Alignment — RLHF and Preference Training

🤖

Ask anything about AI Safety & Alignment — RLHF and Preference Training, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.