Conceptual image of a human overseeing an AI system with safety controls

Alignment Techniques and Research

AGAI 302 · Module 2

Survey the major technical approaches used to make AI systems more helpful, honest, and safe. This module covers RLHF, Constitutional AI, debate, scalable oversight, and interpretability research, while emphasizing that these methods improve alignment but do not solve it completely.

Lessons in this module

RLHF and Preference Training

Learn how reinforcement learning from human feedback trains models to better follow human preferences, and why preference training has important limitations.

Constitutional AI, Debate, and Scalable Oversight

Study methods that try to improve supervision by using written principles, model critique, debate, and AI-assisted review.

Interpretability and Mechanistic Understanding

Learn what interpretability research tries to uncover inside neural networks and why understanding model internals matters for safety.

Ask your AI guide

AI Chat· AI Safety & Alignment — Alignment Techniques and Research

🤖

Ask anything about AI Safety & Alignment — Alignment Techniques and Research, or choose a suggested question below.

AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.