
Alignment Techniques and Research
AGAI 302 · Module 2
Survey the major technical approaches used to make AI systems more helpful, honest, and safe. This module covers RLHF, Constitutional AI, debate, scalable oversight, and interpretability research, while emphasizing that these methods improve alignment but do not solve it completely.
Lessons in this module
RLHF and Preference Training
Learn how reinforcement learning from human feedback trains models to better follow human preferences, and why preference training has important limitations.
Constitutional AI, Debate, and Scalable Oversight
Study methods that try to improve supervision by using written principles, model critique, debate, and AI-assisted review.
Interpretability and Mechanistic Understanding
Learn what interpretability research tries to uncover inside neural networks and why understanding model internals matters for safety.
Ask your AI guide
Ask anything about AI Safety & Alignment — Alignment Techniques and Research, or choose a suggested question below.
AI responses are educational and may not be perfectly accurate. Press Enter to send, Shift+Enter for new line.