Combinatorial Pure Exploration of Multi-armed Bandits with Full-bandit Feedback
We study the problem of stochastic combinatorial pure exploration, where an agent sequentially explores a set of single arms (a.k.a. a super-arm) from given n arms and tries to identify the best super-arm. Most existing work has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm, or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super-arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super-arms K is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose efficient bandit algorithms for the top-k selection problem. We provide a sample complexity upper bound that is still worst-case optimal. We also design a simple algorithm for general combinatorial constraints (e.g. paths, matchings or matroids) and provide another upper bound of the sample complexity. Finally, we conduct experiments on large-scale datasets with more than 10^10 super-arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.
This work is a joint work with Liyuan Xu, Atsushi Miyauchi, Junya Honda, and Masashi Sugiyama (arXiv: https://arxiv.org/abs/1902.10582).