
What Is AI Alignment?
AGAI 302 · The Alignment Problem
Define AI alignment and understand why aligning AI systems with human intent is harder than simply writing better instructions.
Key terms
alignment = AI does what we actually wantstated goal ≠ intended goalouter alignment = right objectiveinner alignment = learned objectiveLearning objectives
- Define AI alignment in practical and technical terms.
- Distinguish stated instructions from intended human goals.
- Explain outer alignment and inner alignment.
- Describe why tool-using agents introduce harder safety challenges.
AI alignment is the problem of building AI systems that reliably do what humans actually want them to do. This sounds simple, but it is one of the deepest challenges in modern AI.
A practical definition is:
An AI system is aligned when its behavior reliably advances the intended human goals, respects relevant constraints, and avoids harmful or deceptive shortcuts, even in novel situations.
Alignment is not the same as intelligence. A system can be highly capable and poorly aligned. It may solve tasks quickly while optimizing the wrong objective, ignoring hidden constraints, or exploiting loopholes in the instructions.
Alignment is also not the same as politeness. A model can sound helpful, friendly, and cooperative while still giving wrong advice, leaking sensitive information, taking unsafe actions, or reinforcing a user’s false belief.
The gap between instructions and intent
The core alignment problem begins with a gap:
What humans say ≠ what humans mean ≠ what humans would endorse after reflection
For example, suppose a user tells an AI assistant:
Get this project finished as fast as possible.
A literal system might skip tests, ignore security review, or delete difficult requirements. But the user probably means:
Make reasonable progress quickly while preserving quality, safety, maintainability, and honesty about tradeoffs.
Humans often communicate with unstated assumptions. We assume others understand context, norms, ethics, laws, and common sense. AI systems do not automatically share those assumptions. They infer patterns from data and instructions, but they may optimize for the visible objective rather than the intended objective.
Outer alignment and inner alignment
Researchers often distinguish between outer alignment and inner alignment.
Outer alignment asks whether the training objective or specification actually represents what humans want. If we train a model to maximize user clicks, but we really want user wellbeing, the outer objective is misaligned.
Inner alignment asks whether the trained system actually learns the intended objective. Even if the training objective is reasonable, a model may learn a proxy strategy that works during training but fails in deployment.
A simplified distinction:
Outer alignment: Did we specify the right goal?
Inner alignment: Did the system learn the goal we intended?
Both are hard. For language models and agents, the objective is usually not one simple reward function. It is a mixture of pretraining, instruction tuning, human feedback, system prompts, tool policies, safety rules, and application constraints.
Near-term and long-term alignment
AI safety includes both near-term practical risks and longer-term frontier risks.
Near-term risks include:
- Hallucinations in high-stakes settings
- Prompt injection
- Tool misuse
- Privacy leakage
- Bias and unfairness
- Overconfident or sycophantic answers
- Unsafe code generation
- Agents taking unintended actions
Longer-term risks concern increasingly capable systems that may plan, use tools, replicate tasks, persuade humans, discover vulnerabilities, or pursue objectives in ways that humans cannot easily supervise.
These are not separate subjects. The same basic problem appears at different scales: the system optimizes something, but we need it to optimize the right thing under real-world constraints.
Alignment for agents is harder than alignment for chatbots
A chatbot primarily produces text. A tool-using agent can act. It may search the web, write files, call APIs, send messages, run code, update records, or coordinate with other agents.
This changes the safety problem.
A chatbot failure might be an incorrect answer. An agent failure might be an incorrect answer plus an unwanted action.
Example:
User: Clean up unused files in this project.
A poorly aligned agent might delete files that appear unused but are required in deployment. A safer agent would inspect references, run tests, create a diff, explain uncertainty, and ask for confirmation before deletion.
For agents, alignment requires system design, not just model training. You need permissions, validation, sandboxing, logging, approval gates, and evaluation traces.
Alignment is not solved
Modern techniques such as RLHF, Constitutional AI, supervised fine-tuning, red-teaming, safety classifiers, and system prompts have made AI assistants much more useful and safer than raw pretrained models. But they do not solve alignment completely.
Models can still:
- Produce plausible falsehoods
- Follow malicious instructions hidden in retrieved content
- Agree with users who are wrong
- Optimize for appearing helpful rather than being correct
- Fail in unfamiliar situations
- Behave differently under distribution shift
- Use tools in unsafe ways if the application permits it
Alignment is best understood as an ongoing engineering and research discipline.
Practical alignment mindset
For developers, alignment starts with clear questions:
What is the system supposed to do?
What should it never do?
What incentives or proxies might it over-optimize?
What information is trusted versus untrusted?
What actions require human approval?
How will we detect failure?
How will we update the system after failures?
A safe system is not one that merely says safe things. It is one whose architecture makes unsafe behavior difficult, visible, and recoverable.
Practical takeaway
AI alignment is the challenge of making AI systems reliably pursue intended human goals under real-world uncertainty. The difficulty comes from ambiguity, hidden assumptions, proxy objectives, distribution shift, and increasing system capability.
For builders, the key lesson is to treat alignment as both a model-level and system-level problem. Model training helps. So do prompts. But safe deployment also requires tools, permissions, evaluation, monitoring, and governance.
Sign in to track your progress.
Ask your AI guide
Ask anything about AI Safety & Alignment — What Is AI Alignment?, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.