Two Recent Results on Regret Minimization and Best Arm Identification in Multi-Armed Bandits
Multi-armed bandits (MAB) is a simple yet powerful model of sequential decision experiments with an exploration-exploitation tradeoff. In this talk, I will present two recent results on regret minimization and best arm identification (a.k.a. pure exploration).
First, we will consider a variant of the classic multi-armed bandit problem where the expected reward of each arm is a function of an unknown parameter. We proposed a new bandit model, called regional bandit, that naturally bridges the classical non-informative bandit setting where the player can only learn the chosen arm, and the global bandit model where sampling one arms reveals information of all arms. We propose an efficient algorithm, UCB-g, that solves the regional bandit problem by combining the Upper Confidence Bound (UCB) and greedy principles. Both parameter-dependent and parameter-free regret upper bounds are derived. We also establish a matching lower bound, which proves the order optimality of UCB-g.
In the second part of this talk, I will discuss the problem of best arm identification in multi-armed bandits, where the problem setting can be either stochastic or adversarial and is not revealed to the forecaster a priori. We propose a universal best arm identification algorithm, called S3-BA, that returns the best arm without knowing the underlying model. The key idea is to simultaneously explore different arms and learn the problem setting. The exploration of best arms starts with the assumption of stochastic bandits, and switches when adversary is declared. We analyze the performance of the proposed algorithm for both stochastic and adversarial bandits. We also propose an adaptive version of S3-BA, called Ad-S3-BA, that facilitates the parameter tuning via simultaneous arm identification and gap estimation.