Building Embodied Autonomous Agents with Multimodal Interaction
In this talk I will show how we can design modular agents for visual navigation that can perform tasks specified by natural language instructions, perform efficient exploration and long-term planning, build and utilize 3D semantic maps, while generalizing across domains and tasks. Specifically, I will first introduce a novel framework that builds and utilizes 3D semantic maps to learn both action and perception in a completely self-supervised manner. I will show that the new framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by moving around in training environments, while the improved perception model can be used to improve on object goal navigation tasks. In the second part of the talk, I will introduce a method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images, I will show that the model is able to achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase its interactive abilities.
Bio: Russ Salakhutdinov is a UPMC Professor of Computer Science in the Department of Machine Learning at CMU. He received his PhD in computer science from the University of Toronto. After spending two post-doctoral years at MIT, he joined the University of Toronto and later moved to CMU. Russ's primary interests lie in deep learning, machine learning, and large-scale optimization. He is an action editor of the Journal of Machine Learning Research, served as a director of AI research at Apple, served on the senior programme committee of several top-tier learning conferences including NeurIPS and ICML, was a program co-chair for ICML 2019, and will serve as a general chair for ICML 2024. He is an Alfred P. Sloan Research Fellow, Microsoft Research Faculty Fellow, a recipient of the Early Researcher Award, Google Faculty Award, and Nvidia's Pioneers of AI award.