Abstract:
We make three contributions toward better understanding policy gradient methods.
First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization.
This result significantly improves recent asymptotic convergence results.
The analysis relies on two findings:
that the softmax policy gradient satisfies a \L{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value.
Second, we analyze entropy regularized policy gradient and show that in the one state (bandit) case it enjoys a linear convergence rate $O(e^{-t})$,
while for general MDPs we prove that it converges at a $O(1/t)$ rate.
This result resolves an open question in the recent literature.
A key insight is that the entropy regularized gradient update behaves similarly to the contraction operator in value learning,
with contraction factor depending on current policy.
Finally, combining the above two results and additional lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.
These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.