Aligning ML Systems with Human Intent
ML systems are "intent aligned" if their behavior matches the intended goals of the system designer, many of which are implicit and informal. Alignment is difficult to achieve, due to misspecified reward functions (Goodhart's law), unexpected behavior that appears emergently at scale, and feedback loops arising from multi-agent interactions. For example, language models trained to predict tokens might give untruthful answers if truth and likelihood diverge, and recommender systems might optimize short-term engagement at the expense of long-term well-being. I will discuss empirically observed alignment issues in large-scale systems, as well as several techniques for addressing them, based on (1) improving human feedback to reduce reward misspecification, (2) extracting latent knowledge from models' hidden states using unsupervised learning. I'll then briefly discuss several other ideas, such as imbuing models with common-sense morality and using large-scale evaluation to detect novel failures.
Bio: Jacob is an Assistant Professor of Statistics at UC Berkeley, where he works on trustworthy and human-aligned machine learning. He received his PhD at Stanford University under Percy Liang and has previously worked at OpenAI and Open Philanthropy.