Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Abstract: We consider the problem of off-policy evaluation for reinforcement learning, where the goal is to estimate the expected reward of a target policy $\pi$ using offline data collected by running a logging policy $\mu$. Standard importance-sampling based approaches for this problem suffer from a variance that scales exponentially with time horizon $H$, which motivates a splurge of recent interest in alternatives that break the 'Curse of Horizon' (Liu et al. 2018, Xie et al. 2019). In particular, it was shown that a marginalized importance sampling (MIS) approach can be used to achieve an estimation error of order $O(H^3/ n)$ in mean square error (MSE) under an episodic Markov Decision Process model with finite states and potentially infinite actions. The MSE bound however is still a factor of $H$ away from a Cramer-Rao lower bound of order $\Omega(H^2/n)$. In this paper, we prove that with a simple modification to the MIS estimator, we can asymptotically attain the Cramer-Rao lower bound, provided that the action space is finite. We also provide a general method for constructing MIS estimators with high-probability error bounds.

06/12/2021

Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning

Ming Yin, Yu-Xiang Wang

Comments

Similar Papers

Bayesian decision-making under misspecified priors with applications to meta-learning

Max Simchowitz, Christopher Tosh, Akshay Krishnamurthy and Daniel Hsu, Thodoris Lykouris, Miro Dudik, Robert E Schapire

Keywords Abstract Paper

meta learning, bandits

Stage-wise Conservative Linear Bandits

Ahmadreza Moradipari, Christos Thrampoulidis, Mahnoosh Alizadeh

Keywords Abstract Paper

Learning from eXtreme Bandit Feedback

Romain Lopez, Inderjit S. Dhillon, Michael I. Jordan

Keywords Abstract Paper

Provably efficient safe exploration via primal-dual policy optimization

Dongsheng Ding, Xiaohan Wei, Zhuoran Yang and Zhaoran Wang, Mihailo Jovanovic

Keywords Abstract Paper

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Ming Yin, Yu Bai, Yu-Xiang Wang

Keywords Abstract Paper

theory, optimization, reinforcement learning and planning

Deep Bandits Show-Off: Simple and Efficient Exploration with Deep Networks

Rong Zhu, Mattia Rigotti

Keywords Abstract Paper

theory, deep learning, reinforcement learning and planning, bandits

Low-rank generalized linear bandit problems

Yangyi Lu, Amirhossein Meisami, Ambuj Tewari

Keywords Abstract Paper

UCB-based Algorithms for Multinomial Logistic Regression Bandits

Sanae Amani, Christos Thrampoulidis

Keywords Abstract Paper

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Gen Li, Laixi Shi, Yuxin Chen and Yuantao Gu, Yuejie Chi

Keywords Abstract Paper

theory, reinforcement learning and planning

Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization

Wes Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

Keywords Abstract Paper

Improved Optimistic Algorithms for Logistic Bandits

Louis Faury, Marc Abeille, Clément Calauzènes, Olivier Fercoq

Keywords Abstract Paper

Online Learning, Active Learning, and Bandits

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

Liyu Chen, Haipeng Luo, Chen-Yu Wei

Keywords Abstract Paper

Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation

Yaqi Duan, Zeyu Jia, Mengdi Wang

Keywords Abstract Paper

Variational Bayesian Optimistic Sampling

Brendan O'Donoghue, Tor Lattimore

Keywords Abstract Paper

optimization, reinforcement learning and planning, generative model, bandits, online learning

Robust Contextual Bandits via Bootstrapping

Qiao Tang, Hong Xie, Yunni Xia and Jia Lee, Qingsheng Zhu

Keywords Abstract Paper

Efficient Bandit Convex Optimization: Beyond Linear Losses

Arun Sai Suggala, Pradeep Ravikumar, Praneeth Netrapalli

Keywords Abstract Paper

Model-based Reinforcement Learning for Continuous Control with Posterior Sampling

Ying Fan, Yifei Ming

Keywords Abstract Paper

Hybrid Regret Bounds for Combinatorial Semi-Bandits and Adversarial Linear Bandits

Keywords Abstract Paper

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

Yingjie Fei, Zhuoran Yang, Yudong Chen and Zhaoran Wang, Qiaomin Xie

Keywords Abstract Paper

Adaptive Sampling for Stochastic Risk-Averse Learning

Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, Andreas Krause

Keywords Abstract Paper

Stochastic bandits with linear constraints

Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, Heinrich Jiang

Keywords Abstract Paper

Tightening Exploration in Upper Confidence Reinforcement Learning

Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

Keywords Abstract Paper

Multinomial Logit Contextual Bandits: Provable Optimality and Practicality

Min-hwan Oh, Garud Iyengar

Keywords Abstract Paper

Contextual blocking bandits

Soumya Basu, Orestis Papadigenopoulos, Constantine Caramanis, Sanjay Shakkottai

Max Simchowitz, Christopher Tosh, Akshay Krishnamurthy and
Daniel Hsu, Thodoris Lykouris, Miro Dudik, Robert E Schapire

Keywords Paper

Keywords Paper

Keywords Paper

Dongsheng Ding, Xiaohan Wei, Zhuoran Yang and
Zhaoran Wang, Mihailo Jovanovic

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Gen Li, Laixi Shi, Yuxin Chen and
Yuantao Gu, Yuejie Chi

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Qiao Tang, Hong Xie, Yunni Xia and
Jia Lee, Qingsheng Zhu

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Yingjie Fei, Zhuoran Yang, Yudong Chen and
Zhaoran Wang, Qiaomin Xie

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Tongzheng Ren, Jialian Li, Bo Dai and
Simon Du, Sujay Sanghavi

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Zhuoran Yang, Chi Jin, Zhaoran Wang and
Mengdi Wang, Michael Jordan

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper