Taming the Sim-to-Real Gap in Reinforcement Learning
Reinforcement learning (RL) is garnering significant interest in recent years due to its success in a wide variety of modern applications. However, how to ensure robustness and safety in the presence of the inevitable sim-to-real gap -- where the deployed environment differs from the training one -- in a sample-efficient manner remains to be a critical challenge. In this talk, we adopt the framework of distributionally robust Markov decision processes (RMDPs), aimed at learning a policy that optimizes the worst-case performance when the deployed environment falls within a prescribed uncertainty set around the nominal MDP. We uncover that RMDPs are not necessarily easier or harder to learn than standard MDPs: the statistical consequence incurred by the robustness requirement depends heavily on the size and shape of the uncertainty set in somewhat surprising manners. We further consider the offline setting, and incorporate the principle of pessimism to counter the insufficient coverage and sample scarcity of historical datasets.
This is based on joint works with Sebastian Jaimungal (UToronto), Cyril Benezet (ENSIIE), and Nick Martin. arXiv links: arxiv.org/abs/2203.09612, arxiv.org/abs/2302.14109