Abstract:
We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning.
GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such density ratios.
Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity,
so primal-dual algorithms are not guaranteed to find the desired solution.
However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation.
This is a fundamental contradiction,
resulting from GenDICE's original formulation of the optimization problem.
In GradientDICE, we optimize a different objective from GenDICE
by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence,
such that nonlinearity in parameterization is not necessary for GradientDICE,
which is provably convergent under linear function approximation.