Conceptual image of a human overseeing an AI system with safety controls
Advanced

AI Safety & Alignment

AGAI 302

Examine the core challenges of building AI systems that are safe, reliable, and aligned with human values. From prompt injection to reward hacking to long-term existential risk, develop a rigorous framework for thinking about AI safety.

Why Safety Is a Technical Problem

AI safety is often treated as a philosophical concern. In practice, it is a deeply technical one. How do you specify what you want an AI system to do? How do you ensure it does not find unintended shortcuts? How do you maintain meaningful human oversight as systems become more capable?

The Alignment Problem

The alignment problem is simple to state and extremely hard to solve: how do you build an AI system that reliably does what you actually want, not just what you said? This course explores the many ways alignment can fail — from specification gaming to reward hacking to deceptive alignment — and surveys the research approaches aimed at solving these problems.

What You Will Learn

You will build a rigorous framework for thinking about AI safety — not as an abstract philosophical concern but as a set of concrete engineering and governance challenges. You will study RLHF, Constitutional AI, interpretability research, and practical red-teaming approaches. You will learn to identify prompt injection and adversarial inputs, think through instrumental convergence, and apply safety thinking to the design of real AI systems.

Who This Course Is For

This course is for engineers, researchers, and product leaders who want to build AI systems responsibly and understand the technical landscape of AI safety. It is appropriate for practitioners who have built real AI systems and want to understand how and why they can fail — and for anyone who wants to engage seriously with the technical dimensions of AI risk rather than relying on vague intuitions.

What you will learn

  • Define AI alignment and explain why it is technically difficult
  • Describe the main ways AI systems can fail to align with human values
  • Explain RLHF and its role in aligning language models
  • Identify prompt injection and other adversarial risks
  • Summarize key interpretability research approaches
  • Apply safety thinking to real AI system design

Major topics

What is AI alignment and why is it hard?Reward hacking and specification gamingPrompt injection and adversarial inputsRLHF and its limitationsConstitutional AI and rule-based approachesInterpretability and mechanistic understandingInstrumental convergence and AI goalsGovernance, policy, and responsible deployment

Why this course matters

As AI systems become more capable and more autonomous, the consequences of misalignment grow. Understanding alignment — not as an abstract concern but as a concrete engineering challenge — is essential for anyone building or deploying AI systems.

Course modules

Common misconceptions

  • AI alignment is only a concern for superintelligent AI

  • Safety filters solve the alignment problem

  • RLHF makes models fully aligned with human values

  • AI safety and AI capability are always in tension

Ask your AI guide

AI Chat· AI Safety & Alignment
🤖

Ask anything about AI Safety & Alignment, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.

Related courses