Introduction
Stuart Russell's formulation of the Alignment Problem, developed most fully in his 2019 book 'Human Compatible,' is both a diagnosis and a proposed solution. The diagnosis: standard AI systems optimizing for a fixed objective are fundamentally incompatible with human safety. The solution: build AI systems that are uncertain about their objectives and defer to humans.
The Setup
Consider the standard model of AI: you specify an objective, the AI optimizes for it. The problem is that any fixed objective you can specify is, at best, a proxy for what you actually want. AI systems that optimize a proxy objective with superhuman capability will find ways to satisfy the proxy that violate the intent behind it. The only fix, Russell argues, is to build systems that treat human preferences as something to be learned, not something that is already known.
The Paradox or Question
The central question is whether it is possible to specify human values well enough to serve as a fixed AI objective. Russell argues the answer is no — not because we cannot write down many values, but because human values are too complex, contextual, and internally inconsistent to be captured in any fixed specification.
How It Changed AI
Russell proposes a new foundation for AI: cooperative AI systems that are uncertain about human values and take actions to learn them. Such systems would be deferential to humans by default — not because they are programmed to be, but because uncertainty about their objective makes deference the rational strategy. This framework has influenced thinking about AI alignment, though its full implementation remains a research challenge.
Historical Context
Russell's formulation built on decades of work in reinforcement learning and decision theory. His 2019 book brought these ideas to a broad audience and has been widely read by both AI researchers and policymakers.
Related AI Concepts
Relevance Today
Russell's framing of the alignment problem directly influences RLHF and Constitutional AI — approaches that attempt to learn human values from feedback rather than specifying them directly. His insight that AI systems should be uncertain about their objectives and defer to humans has become a guiding principle in alignment research.
