The University of Southampton
University of Southampton Institutional Repository

Variational delayed policy optimization

Variational delayed policy optimization
Variational delayed policy optimization
In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). Whereas, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks commonly suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve the learning efficiency without sacrificing performance, this work novelly introduces Variational Delayed Policy Optimization (VDPO), reforming delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.
Wu, Qingyuan
c0101d61-5388-417a-b3a8-3eb3aaab1e5d
Zhan, Simon Sinong
a1183e07-c3a7-4b82-b01e-991a3cdd997f
Wang, Yixuan
bd79cf17-6e58-4d7f-bf8d-482a35260a90
Wang, Yuhui
845ed006-3dfc-4b83-b915-74730425c8e1
Lin, Chung-Wei
53a3aa06-dc6d-4115-816b-8ec3a64ab4d1
Lv, Chen
ad87a9c6-1b5b-4670-8ec3-75c30e6a8ed7
Zhu, Qi
aea85729-2a65-4f3c-8926-58deb8159a14
Huang, Chao
d04ceba3-2293-4792-bdb9-11e05b5a9d41
Globerson, A.
Mackey, L.
Belgrave, D.
Fan, A.
Paquet, U.
Tomczak, J.
Zhang, C.
Wu, Qingyuan
c0101d61-5388-417a-b3a8-3eb3aaab1e5d
Zhan, Simon Sinong
a1183e07-c3a7-4b82-b01e-991a3cdd997f
Wang, Yixuan
bd79cf17-6e58-4d7f-bf8d-482a35260a90
Wang, Yuhui
845ed006-3dfc-4b83-b915-74730425c8e1
Lin, Chung-Wei
53a3aa06-dc6d-4115-816b-8ec3a64ab4d1
Lv, Chen
ad87a9c6-1b5b-4670-8ec3-75c30e6a8ed7
Zhu, Qi
aea85729-2a65-4f3c-8926-58deb8159a14
Huang, Chao
d04ceba3-2293-4792-bdb9-11e05b5a9d41
Globerson, A.
Mackey, L.
Belgrave, D.
Fan, A.
Paquet, U.
Tomczak, J.
Zhang, C.

Wu, Qingyuan, Zhan, Simon Sinong, Wang, Yixuan, Wang, Yuhui, Lin, Chung-Wei, Lv, Chen, Zhu, Qi and Huang, Chao (2024) Variational delayed policy optimization. Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J. and Zhang, C. (eds.) In Advances in Neural Information Processing Systems 37 (NeurIPS 2024).

Record type: Conference or Workshop Item (Paper)

Abstract

In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). Whereas, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks commonly suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve the learning efficiency without sacrificing performance, this work novelly introduces Variational Delayed Policy Optimization (VDPO), reforming delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.

This record has no associated files available for download.

More information

Published date: 10 December 2024

Identifiers

Local EPrints ID: 500954
URI: http://eprints.soton.ac.uk/id/eprint/500954
PURE UUID: e1c06fc5-ad24-479b-b96c-7c20d6e1f431
ORCID for Chao Huang: ORCID iD orcid.org/0000-0002-9300-1787

Catalogue record

Date deposited: 19 May 2025 17:10
Last modified: 20 May 2025 02:14

Export record

Contributors

Author: Qingyuan Wu
Author: Simon Sinong Zhan
Author: Yixuan Wang
Author: Yuhui Wang
Author: Chung-Wei Lin
Author: Chen Lv
Author: Qi Zhu
Author: Chao Huang ORCID iD
Editor: A. Globerson
Editor: L. Mackey
Editor: D. Belgrave
Editor: A. Fan
Editor: U. Paquet
Editor: J. Tomczak
Editor: C. Zhang

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×