Abstract:
Value function learning remains a critical component of many reinforcement learning systems. Many algorithms are based on temporal difference (TD) updates, which have well-documented divergence issues, even though potentially sound alternatives exist like Gradient TD. Unsound approaches like Q-learning and TD remain popular because divergence seems rare in practice and these algorithms typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an easy to use and performant TD method, or a more complex algorithm that is more sound but harder to tune, less sample efficient, and underexplored with control. In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. It behaves as well as TD, when TD performs well, but is sound even in cases where TD diverges. We characterize the expected update for TDRC, and show that it inherits soundness guarantees from Gradient TD, and converges to the same solution as TD. Empirically, TDRC exhibits good performance and low parameter sensitivity across several problems.