Online sparse reinforcement learning

Abstract: We investigate the hardness of online reinforcement learning in sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable, even if there exists a policy that collects well-conditioned data. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data, then a variant of Lasso fitted Q-iteration enjoys a regret of O(N^{2/3}) where N is the number of episodes.

Online sparse reinforcement learning

Botao Hao, Tor Lattimore, Csaba Szepesvari, Mengdi Wang

Comments

Similar Papers