Abstract:
Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with
a massive number of states.
From a theoretical perspective, however,
RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency,
due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation.
To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network,
we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model.
In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\cF} H^2 \sqrt{T})$ regret, where $\delta_{\cF}$ characterizes the intrinsic complexity of the function class $\cF$, $H$ is the length of each episode, and $T$ is the total number of episodes.
Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.