09/07/2020

Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Kefan Dong, Jian Peng, Yining Wang, Yuan Zhou

Keywords: Reinforcement learning,

Abstract: In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very\nlarge state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we\ndevelop an online learning algorithm that learns the optimal value function while at the same time achieving\nvery low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function\nElimination (AVE), is inspired by the policy elimination algorithm proposed in Jiang et al. (2017), known\nas OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as\ncontextual bandit problems. This technique enables us to apply the active elimination and expert weighting\nmethods from Dudik et al. (2011), instead of the random action exploration scheme used in the original\nOLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy\n elimination step. To the best of our knowledge, this is the first in stochastic MDPs with general value function approximation.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at COLT 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers