Policy Optimization as Online Learning with Mediator Feedback

Abstract: Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.

19/08/2021

Deep Learning, Adversarial Networks, Applications, Fairness, Accountability, and Transparency, Theory, RL, Decisions and Control Theory

5:03

02/02/2021

Policy Optimization as Online Learning with Mediator Feedback

Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, Marcello Restelli

Comments

Similar Papers

Neural Regret-Matching for Distributed Constraint Optimization Problems

Yanchen Deng, Runsheng Yu, Xinrun Wang, Bo An

Keywords Abstract Paper

Agent-based and Multi-agent Systems, Coordination and Cooperation, Constraint Optimization, Distributed Constraints

Tracking regret bounds for online submodular optimization

Tatsuya Matsuoka, Shinji Ito, Naoto Ohsaka

Keywords Abstract Paper

A Primal-Dual Online Algorithm for Online Matching Problem in Dynamic Environments

Yu-Hang Zhou, Peng Hu, Chen Liang and Huan Xu, Guangda Huzhang, Yinfu Feng, Qing Da, Xinshang Wang, An-Xiang Zeng

Keywords Abstract Paper

Boosting for Online Convex Optimization

Elad Hazan, Karan Singh

Keywords Abstract Paper

Theory, Online Learning Theory

Learning piecewise Lipschitz functions in changing environments

Dravyansh Sharma, Maria-Florina Balcan, Travis Dick

Keywords Abstract Paper

Delay and Cooperation in Nonstochastic Linear Bandits

Shinji Ito, Daisuke Hatano, Hanna Sumita and Kei Takemura, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi

Keywords Abstract Paper

Dynamic Regret of Convex and Smooth Functions

Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou

Keywords Abstract Paper

Efficient Bandit Convex Optimization: Beyond Linear Losses

Arun Sai Suggala, Pradeep Ravikumar, Praneeth Netrapalli

Keywords Abstract Paper

Variational Bayesian Optimistic Sampling

Brendan O'Donoghue, Tor Lattimore

Keywords Abstract Paper

optimization, reinforcement learning and planning, generative model, bandits, online learning

Provably Correct Optimization and Exploration with Non-linear Policies

Fei Feng, Wotao Yin, Alekh Agarwal, Lin Yang

Keywords Abstract Paper

Deep Learning, Adversarial Networks, Applications, Fairness, Accountability, and Transparency, Theory, RL, Decisions and Control Theory

Projection-free Online Learning in Dynamic Environments

Yuanyu Wan, Bo Xue, Lijun Zhang

Keywords Abstract Paper

Stochastic Online Linear Regression: the Forward Algorithm to Replace Ridge

Reda Ouhamma, Odalric-Ambrym Maillard, Vianney Perchet

Keywords Abstract Paper

robustness, bandits

Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

Kyungjae Lee, Hongjun Yang, Sungbin Lim, Songhwai Oh

Keywords Abstract Paper

Information Directed Sampling for Sparse Linear Bandits

Botao Hao, Tor Lattimore, Wei Deng

Keywords Abstract Paper

bandits

Non-Exponentially Weighted Aggregation: Regret Bounds for Unbounded Loss Functions

Pierre Alquier

Keywords Abstract Paper

Probabilistic Methods, Bayesian Methods

A Simple Approach for Non-stationary Linear Bandits

Peng Zhao, Lijun Zhang, Yuan Jiang, Zhi-Hua Zhou

Keywords Abstract Paper

Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

Liyu Chen, Haipeng Luo, Chen-Yu Wei

Keywords Abstract Paper

Online Learning with Continuous Variations: Dynamic Regret and Reductions

Ching-An Cheng, Jonathan Lee, Ken Goldberg, Byron Boots

Keywords Abstract Paper

Optimizing Optimizers: Regret-optimal gradient descent algorithms

Philippe Casgrain, Anastasis Kratsios

Keywords Abstract Paper

Online Markov Decision Processes with Aggregate Bandit Feedback

Alon Cohen, Haim Kaplan, Tomer Koren, Yishay Mansour

Keywords Abstract Paper

Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games

Gabriele Farina, Tuomas Sandholm

Keywords Abstract Paper

Root-n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Kefan Dong, Jian Peng, Yining Wang, Yuan Zhou

Keywords Abstract Paper

Reinforcement learning,

Adaptive Sampling for Stochastic Risk-Averse Learning

Sebastian Curi, Kfir Y. Levy, Stefanie Jegelka, Andreas Krause

Keywords Paper

Keywords Paper

Yu-Hang Zhou, Peng Hu, Chen Liang and
Huan Xu, Guangda Huzhang, Yinfu Feng, Qing Da, Xinshang Wang, An-Xiang Zeng

Keywords Paper

Keywords Paper

Keywords Paper

Shinji Ito, Daisuke Hatano, Hanna Sumita and
Kei Takemura, Takuro Fukunaga, Naonori Kakimura, Ken-Ichi Kawarabayashi

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Ayya Alieva, Aiden Aceves, Jialin Song and
Stephen Mayo, Yisong Yue, Yuxin Chen

Keywords Paper

Keywords Paper

Keywords Paper

Udaya Ghai, Holden Lee, Karan Singh and
Cyril Zhang, Yi Zhang

Keywords Paper

Lijun Zhang, Guanghui Wang, Wei-Wei Tu and
Wei Jiang, Zhi-Hua Zhou

Keywords Paper

Keywords Paper