Abstract:
According to previous studies, one of the major impediments to accurate off-policy learning is the overestimation bias. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. We show that all components are key for the achieved performance. Distributional representation combined with truncation allows for arbitrary granular overestimation control, and ensembling further improves the results of our method. TQC significantly outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.