
AI Safety & Alignment
AGAI 302
Examine the core challenges of building AI systems that are safe, reliable, and aligned with human values. From prompt injection to reward hacking to long-term existential risk, develop a rigorous framework for thinking about AI safety.
Why Safety Is a Technical Problem
AI safety is often treated as a philosophical concern. In practice, it is a deeply technical one. How do you specify what you want an AI system to do? How do you ensure it does not find unintended shortcuts? How do you maintain meaningful human oversight as systems become more capable?
The Alignment Problem
The alignment problem is simple to state and extremely hard to solve: how do you build an AI system that reliably does what you actually want, not just what you said? This course explores the many ways alignment can fail — from specification gaming to reward hacking to deceptive alignment — and surveys the research approaches aimed at solving these problems.
What You Will Learn
You will build a rigorous framework for thinking about AI safety — not as an abstract philosophical concern but as a set of concrete engineering and governance challenges. You will study RLHF, Constitutional AI, interpretability research, and practical red-teaming approaches. You will learn to identify prompt injection and adversarial inputs, think through instrumental convergence, and apply safety thinking to the design of real AI systems.
Who This Course Is For
This course is for engineers, researchers, and product leaders who want to build AI systems responsibly and understand the technical landscape of AI safety. It is appropriate for practitioners who have built real AI systems and want to understand how and why they can fail — and for anyone who wants to engage seriously with the technical dimensions of AI risk rather than relying on vague intuitions.
What you will learn
- Define AI alignment and explain why it is technically difficult
- Describe the main ways AI systems can fail to align with human values
- Explain RLHF and its role in aligning language models
- Identify prompt injection and other adversarial risks
- Summarize key interpretability research approaches
- Apply safety thinking to real AI system design
Major topics
Why this course matters
As AI systems become more capable and more autonomous, the consequences of misalignment grow. Understanding alignment — not as an abstract concern but as a concrete engineering challenge — is essential for anyone building or deploying AI systems.
Course modules
The Alignment Problem
Understand what AI alignment means and why it is technically difficult. This module introduces the gap between stated objectives and human intent, then explores reward hacking, specification gaming, and long-term concerns such as instrumental convergence.
Alignment Techniques and Research
Survey the major technical approaches used to make AI systems more helpful, honest, and safe. This module covers RLHF, Constitutional AI, debate, scalable oversight, and interpretability research, while emphasizing that these methods improve alignment but do not solve it completely.
Practical Safety for Builders
Apply AI safety thinking to real systems. This module covers prompt injection, adversarial inputs, red-teaming, deployment guardrails, monitoring, governance, and responsible release practices for tool-using and multi-agent systems.
Common misconceptions
AI alignment is only a concern for superintelligent AI
Safety filters solve the alignment problem
RLHF makes models fully aligned with human values
AI safety and AI capability are always in tension
Ask your AI guide
Ask anything about AI Safety & Alignment, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.
Related courses
Agent Architectures
Survey the major architectural patterns for building AI agents. From simple ReAct loops to structured planning systems, learn how different architectures trade off capability, reliability, and interpretability.
Multi-Agent Systems
Explore the design and behavior of systems with multiple collaborating AI agents. Learn how agents communicate, coordinate, divide labor, and resolve conflicts — and how emergent behaviors arise when many agents interact.
Agentic AI in the Real World
Survey how agentic AI is being deployed across industries today. From software engineering and scientific research to healthcare and finance, examine real-world use cases, the lessons learned, and the challenges that remain unsolved.