Abstract:
We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. Our method sequentially estimates the policy, the $Q$-function, and the reward. We refer to it as the PQR method. This method does not require the assumption that the reward depends on the state only, but instead allows also for dependency on the choice of action. Moreover, the method allows for the state transitions to be stochastic. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the reward estimator of PQR uniquely recovers the true reward.
With unknown transitions, convergence analysis is presented for the PQR method.
Finally, we apply PQR to both synthetic and real-world datasets, demonstrating superior performance in terms of reward estimation compared to competing methods.