12/07/2020

Identifying the Reward Function by Anchor Actions

Sinong Geng, Houssam Nassif, Carlos Manzanares, Max Reppen, Ronnie Sircar

Keywords: Reinforcement Learning - General

Abstract: We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. Our method sequentially estimates the policy, the $Q$-function, and the reward. We refer to it as the PQR method. This method does not require the assumption that the reward depends on the state only, but instead allows also for dependency on the choice of action. Moreover, the method allows for the state transitions to be stochastic. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the reward estimator of PQR uniquely recovers the true reward. With unknown transitions, convergence analysis is presented for the PQR method. Finally, we apply PQR to both synthetic and real-world datasets, demonstrating superior performance in terms of reward estimation compared to competing methods.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at ICML 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers