Data-adaptive model selection in online learning
We consider sequential learning with full information and contextual bandit (limited-information) feedback. Such models are commonly used in platforms for online advertisement placement, repeated auctions, wireless channel allocation and mobile healthcare. Theoretical guarantees are traditionally obtained in the form of (worst-case) regret with respect to a fixed offline benchmark. However, to maximize overall reward, the choice of offline benchmark matters as much as the choice of regret minimizing algorithm.
In this talk, I argue that the choice of offline benchmark should be adapted in an online fashion according to the data itself, and provide computationally efficient approaches to do this in contextual prediction with both full-information and bandit feedback. We obtain guarantees on regret/reward that are almost as good as an “oracle” algorithm which has access to the right generative model for the data, including knowing whether the environment is actually stochastic or not. We achieve this by adapting classical statistical techniques -- structural risk minimization, validation and hypothesis testing -- to the sequential setting with potentially adversarial data and adaptive sampling.
Joint work with: Mitas Ray, Niladri Chatterji, Peter Bartlett and Anant Sahai.