Abstract:
We propose a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation (BMLE) method originally appearing in the adaptive control literature. We design the reward-bias term to tackle the exploration and exploitation tradeoff for stochastic bandit problems. We provide a general recipe for the BMLE algorithm and derive a simple explicit closed-form expression for the index of an arm for exponential family reward distributions. We prove that the derived BMLE indices achieve a logarithmic finite-time regret bound and hence attain order-optimality, for both exponential families and the cases beyond parametric distributions. Through extensive simulations, we demonstrate that the proposed algorithms achieve regret performance comparable to the best of several state-of-the-art baseline methods, while being computationally efficient in comparison to other best-performing methods. The generality of the proposed approach makes it possible to address more complex models, including general adaptive control of Markovian systems.