Abstract:
Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally intractable for complex control problems. We propose a highly efficient Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by first pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference via sampling. We evaluate our proposed approach on the task of learning to play Atari games from demonstrations, without access to the game score, and achieve state-of-the-art imitation learning performance. Furthermore, we also demonstrate that our approach enables efficient high-confidence performance bounds for any evaluation policy. We show that these high-confidence performance bounds can be used to accurately rank the performance and risk of a variety of different evaluation policies, despite not having samples of the true reward function.