13/04/2021

Sample complexity bounds for two timescale value-based reinforcement learning algorithms

Tengyu Xu, Yingbin Liang

Keywords:

Abstract: Two timescale stochastic approximation (SA) has been widely used in value-based reinforcement learning algorithms. In the policy evaluation setting, it can model the linear and nonlinear temporal difference learning with gradient correction (TDC) algorithms as linear SA and nonlinear SA, respectively. In the policy optimization setting, two timescale nonlinear SA can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been studied in the Markovian setting, with single-sample update at each iteration. For the nonlinear TDC algorithm, only the asymptotic convergence has been established. In this paper, we study the non-asymptotic convergence rate of two time-scale linear and nonlinear TDC and Greedy-GQ under Markovian sampling and with mini-batch data for each update. For linear TDC, we provide a novel non-asymptotic analysis and our sample complexity result achieves the complexity \mathcal{O}(\epsilon^{-1}\log(1/\epsilon)). For nonlinear TDC and Greedy-GQ, we show that both algorithms attain \epsilon-accurate stationary solution with sample complexity \mathcal{O}(\epsilon^{-2}). It is the first time that non-asymptotic convergence result has been established for nonlinear TDC and our result for Greedy-GQ outperforms previous result orderwisely by a factor of \mathcal{O}(\epsilon^{-1}\log(1/\epsilon)).

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at AISTATS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers