Abstract:
We consider bandit optimization of a smooth reward function, where the goal is cumulative regret minimization. This problem has been studied for \alpha-Holder continuous (including Lipschitz) functions with 0<\alpha\leq 1. Our main result is in generalization of the reward function to Holder space with exponent \alpha>1 to bridge the gap between Lipschitz bandits and infinitely-differentiable models such as linear bandits. For Holder continuous functions, approaches based on random sampling in bins of a discretized domain suffices as optimal. In contrast, we propose a class of two-layer algorithms that deploy misspecified linear/polynomial bandit algorithms in bins. We demonstrate that the proposed algorithm can exploit higher-order smoothness of the function by deriving a regret upper bound of \tilde{O}(T^\frac{d+\alpha}{d+2\alpha}) for when \alpha>1, which matches existing lower bound. We also study adaptation to unknown function smoothness over a continuous scale of Holder spaces indexed by \alpha, with a bandit model selection approach applied with our proposed two-layer algorithms. We show that it achieves regret rate that matches the existing lower bound for adaptation within the \alpha\leq 1 subset.