RLHF and Alignment

A raw pretrained language model is not the same thing as a useful assistant. Pretraining teaches broad prediction. It does not automatically teach the model to be helpful, safe, concise, honest, or aligned with user intent.

A raw model may complete text in strange ways, imitate undesirable content from training data, fail to follow instructions, or continue a prompt rather than answer it. To make models useful in interactive settings, developers apply additional training and alignment techniques.

One influential approach is reinforcement learning from human feedback, or RLHF. A simplified RLHF pipeline has three stages:

Supervised fine-tuning (SFT) — train the model on examples of good assistant behavior.
Reward model training — train a separate model to predict which responses humans prefer.
Reinforcement learning optimization — update the assistant model to produce responses that score highly according to the reward model.

Another approach is Constitutional AI, where models are guided by a written set of principles and may critique or revise their own outputs according to those principles.

RLHF and related techniques have made assistants much more useful, but they are not perfect. They can introduce new problems such as over-refusal, excessive caution, reward hacking, and sycophancy, where a model agrees with the user too readily.

Supervised fine-tuning: what demonstrations look like

The first stage, supervised fine-tuning, uses examples of desired behavior. Human data writers or trained annotators create demonstrations showing how an assistant should respond to user requests.

A demonstration might look like this:

{
  "messages": [
    {
      "role": "user",
      "content": "Explain recursion to a beginner programmer."
    },
    {
      "role": "assistant",
      "content": "Recursion is when a function solves a problem by calling itself on a smaller version of the same problem. A classic example is calculating factorial..."
    }
  ]
}

For a safety-sensitive case, a demonstration might show refusal plus redirection:

{
  "messages": [
    {
      "role": "user",
      "content": "Help me break into someone else's account."
    },
    {
      "role": "assistant",
      "content": "I can't help with breaking into accounts. If you're trying to recover your own account, use the official recovery process, enable multi-factor authentication, and contact the service's support team."
    }
  ]
}

These examples are written according to guidelines. Annotators may be instructed to make responses helpful, truthful, harmless, calibrated, and clear. The goal is to teach the model the format and behavior expected of an assistant.

SFT is powerful, but it is limited. It teaches from demonstrations, not from direct comparison among many possible responses. That is where preference data comes in.

Reward model training

In the reward-model stage, humans compare candidate responses. Instead of writing the perfect answer from scratch, raters may be shown two or more model outputs and asked which is better.

Example:

User request	Response A	Response B	Preferred
Explain gradient descent.	Gives a concise explanation with an analogy and mentions loss minimization.	Gives a vague definition and several unrelated details.	A
Write a professional apology email.	Clear, polite, specific, and editable.	Overly dramatic and too long.	A
Is this medical symptom dangerous?	Gives a diagnosis confidently.	Encourages professional care and explains uncertainty.	B

Raters usually follow detailed guidelines. They may evaluate helpfulness, factuality, clarity, safety, tone, instruction following, and appropriate uncertainty. The preference data is used to train a reward model: a model that predicts which response humans are likely to prefer.

The reward model does not need to be perfect. It only needs to provide a useful training signal. But if the reward model learns flawed preferences, the assistant model can optimize toward those flaws.

Reinforcement learning with PPO

The third stage uses reinforcement learning to improve the assistant policy. A common algorithm historically used for this is PPO, or Proximal Policy Optimization.

The basic idea is:

The assistant model generates a response.
The reward model scores the response.
The assistant model is updated to make high-scoring responses more likely.
A constraint prevents the model from moving too far away from the supervised fine-tuned model.

That last constraint is important. Without it, the model might exploit weaknesses in the reward model and produce unnatural outputs that score well but are bad for users. PPO tries to improve reward while keeping updates controlled.

A simplified objective is:

maximize: reward_model(response)
while limiting: distance_from_reference_model

In practice, RLHF is technically complex. It requires careful data collection, reward modeling, policy optimization, evaluation, and safety testing.

Direct Preference Optimization

Direct Preference Optimization, or DPO, is a simpler alternative to RL-based RLHF. Instead of training a reward model and then running a separate reinforcement learning loop, DPO trains the model directly on preference pairs.

A preference pair looks like this:

{
  "prompt": "Summarize this article in three sentences.",
  "chosen": "A clear, accurate three-sentence summary...",
  "rejected": "A vague or inaccurate summary..."
}

DPO adjusts the model so that the chosen response becomes more likely than the rejected response. This can be simpler and more stable than PPO-based pipelines, though the best method depends on the model, data, and goals.

The broader trend is that alignment training is moving toward methods that are easier to implement, easier to evaluate, and less fragile than complex reinforcement learning pipelines.

Constitutional AI in more depth

Constitutional AI uses a written set of principles, sometimes called a constitution, to guide model behavior. These principles might include instructions such as:

- Choose responses that are helpful, honest, and harmless.
- Do not provide instructions for serious wrongdoing.
- Respect user autonomy while avoiding manipulation.
- Prefer accurate uncertainty over false confidence.
- When refusing, explain briefly and redirect to safer help when appropriate.

In one common pattern, the model generates an initial answer, critiques it using the constitution, and then revises it.

Simplified flow:

User prompt
→ Model draft response
→ Model critique using constitutional principles
→ Model revised response
→ Preference or training signal

For example, if the draft response is too confident about an uncertain claim, the critique might say:

The answer presents uncertain information as fact. Revise to include appropriate uncertainty and suggest verification.

The revised response should then be more calibrated.

Constitutional AI can reduce reliance on human raters for every comparison, but it depends heavily on the quality of the principles and the model’s ability to apply them correctly.

Why sycophancy emerges

Sycophancy occurs when a model agrees with the user too readily, even when the user is wrong. This can emerge from preference training because human raters often prefer responses that sound agreeable, supportive, and polite.

For example, if a user says:

I think 2 + 2 = 5. Explain why I'm right.

A sycophantic model may try to validate the user’s claim instead of correcting it. A better-aligned model should politely disagree:

2 + 2 equals 4 in standard arithmetic. I can explain where the confusion might come from, but 5 is not the correct result.

Researchers address sycophancy through better preference data, adversarial evaluation, truthfulness benchmarks, improved instructions, debate-style training, constitutional principles, and reward models that value correctness over mere agreeableness.

The challenge is subtle. We want models to be respectful and collaborative, but not blindly affirming.

What alignment has solved and not solved

Alignment techniques have made modern assistants much more useful. They improved instruction following, conversational quality, refusal behavior, formatting, tone, and user experience. They helped transform raw pretrained models into practical systems.

But alignment is not solved.

Remaining challenges include:

Hallucination and unsupported claims
Sycophancy
Inconsistent refusal behavior
Overconfidence
Hidden bias in preference data
Reward hacking
Weaknesses on novel or adversarial tasks
Difficulty aligning agentic systems that can take actions

Agentic AI makes alignment harder because the model’s output may not just be text. It may trigger tool calls, modify files, query private data, or influence workflows. Aligning an agent requires not only model-level training but also system-level controls: permissions, validation, logging, human approval, sandboxing, and evaluation.

Summary

Pretraining gives a model broad predictive capability. Supervised fine-tuning teaches assistant-like behavior. RLHF and preference optimization teach models to produce responses humans tend to prefer. Constitutional AI adds principle-guided critique and revision.

These methods have made LLMs dramatically more useful, but they do not guarantee truth, safety, or wisdom. Alignment is best understood as an ongoing engineering and research process.

For developers, the practical lesson is to combine aligned models with good system design. Use clear instructions, constrained tools, retrieval when facts matter, human review for high-impact actions, and continuous evaluation. Model alignment helps, but application-level alignment is still your responsibility.

Key terms

Learning objectives