13/04/2021

Q-learning with logarithmic regret

Kunhe Yang, Lin Yang, Simon Du

Keywords:

Abstract: This paper presents the first non-asymptotic result showing a model-free algorithm can achieve logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a {\mathcal{O}}\!\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right) cumulative regret bound where S is the number of states, A is the number of actions, H is the planning horizon, T is the total number of steps, and \Delta_{\min} is the minimum sub-optimality gap of the optimal Q-function. This bound matches the information theoretical lower bound in terms of S,A,T up to a \log\left(SA\right) factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at AISTATS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers