Abstract:
There has been a recent surge of interest in non-parametric bandit algorithms
based on subsampling. One drawback however of these approaches is the
additional complexity required by random subsampling and the storage
of the full history of rewards. Our first contribution is to show
that a simple deterministic subsampling rule, proposed in the recent work of
\citet{baudry2020sub} under the name of “last-block subsampling”, is
asymptotically optimal in one-parameter exponential families. In
addition, we prove that these guarantees also hold when limiting
the algorithm memory to a polylogarithmic function of the
time horizon. These findings open up new perspectives, in
particular for non-stationary scenarios in which the arm
distributions evolve over time. We propose a variant of the
algorithm in which only the most recent observations are
used for subsampling, achieving optimal regret guarantees
under the assumption of a known number of abrupt changes. Extensive
numerical simulations highlight the merits of this approach, particularly
when the changes are not only affecting the means of the rewards.