Abstract:
Offline Reinforcement Learning (RL) is a promising approach for learning optimal
policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard
to interpret within the application context, and lack measures of uncertainty for the
learned policy value and its decisions. To overcome these issues, we propose an
Expert-Supervised RL (ESRL) framework which uses uncertainty quantification
for offline policy learning. In particular, we have three contributions: 1) the method
can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for
different levels of risk averse implementations tailored to the application context,
and finally, 3) we propose a way to interpret ESRL’s policy at every state through
posterior distributions, and use this framework to compute off-policy value function
posteriors. We provide theoretical guarantees for our estimators and regret bounds
consistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRL
is independent of the chosen risk aversion threshold and quality of the behavior
policy.