Abstract:
In this paper we present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with \A total actions and mixing time bound \tmix our method computes an \eps-optimal policy with an expected \Otil(\tmix^2 \A \eps^{-2}) samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a \gamma-discounted MDP with A total actions our method computes an eps-optimal policy with an expected \Otil((1-\gamma)^{-4} \A \eps^{-2}) samples, improving over the best-known primal-dual methods while matching the state-of-the-art up to a (1-\gamma)^{-1} factor. Both methods are model-free, update state values and policy simultaneously, and run in time linear in the number of samples taken.