Abstract:
Large-scale Markov decision processes (MDPs) require planning algorithms with
runtime independent of the number of states of the MDP. We consider the planning
problem in MDPs using linear value function approximation with only weak
requirements: low approximation error for the optimal value function, and a small
set of “core” states whose features span those of other states. In particular, we
make no assumptions about the representability of policies or value functions of
non-optimal policies. Our algorithm produces almost-optimal actions for any state
using a generative oracle (simulator) for the MDP, while its computation time scales
polynomially with the number of features, core states, and actions and the effective
horizon.