Lessons from aligning robots to human preferences
Our current methods for making large pretrained models helpful, truthful, working towards what people want, and staying clear of negative side-effects on users and society, are falling short. The reward models they learn fail to capture these values, in part due to confounding variables that are easier to learn, and in part due to misinterpreting human feedback. In this talk, I want to dive into the lessons we've learned on the robotics and sequential decision making side on learning good reward models, and how these lessons might transfer to aligning large pretrained models.
Bio: Anca Dragan is an associate professor in the EECS Department at UC Berkeley. Her goal is to enable AI agents to work for and around people. She runs the InterACT laboratory, where she focuses on algorithmic human-robot interaction: algorithms that move beyond the robot's function in isolation and generate robot behavior that coordinates well with human actions and is aligned with what humans actually want the robot to do. Anca received her Ph.D. from Carnegie Mellon University's Robotics Institute. She helped found the Berkeley AI Research Laboratory, and is co-principal investigator of the Center for Human-Compatible AI. She has been honored by the Presidential Early Career Award for Scientists and Engineers (PECASE), NSF CAREER, Sloan, Okawa, ONR Young Investigator Award, MIT TR35, and the IEEE RAS Early Academic Career Award.