Walking the values in Bayesian inverse reinforcement learning
Walking the values in Bayesian inverse reinforcement learning
The goal of Bayesian inverse reinforcement learning (IRL) is recovering a posterior distribution over reward functions using a set of demonstrations from an expert optimizing for a reward unknown to the learner. The resulting posterior over rewards can then be used to synthesize an apprentice policy that performs well on the same or a similar task. A key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood, often defined in terms of Q values: vanilla Bayesian IRL needs to solve the costly forward planning problem - going from rewards to the Q values - at every step of the algorithm, which may need to be done thousands of times. We propose to solve this by a simple change: instead of focusing on primarily sampling in the space of rewards, we can focus on primarily working in the space of Q-values, since the computation required to go from Q-values to reward is radically cheaper. Furthermore, this reversion of the computation makes it easy to compute the gradient allowing efficient sampling using Hamiltonian Monte Carlo. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight - and illustrate its advantages on several tasks.
Bajgar, O.
4e5b5636-21c7-462b-899f-dcd1845368b9
Abate, A.
fd47d8e5-3b83-432d-ba1e-32cc3372c854
Gatsis, K.
f808d11b-38f1-4a44-ba56-3364d63558d7
Osborne, M.A.
73eecc74-5f67-4f6e-9a99-55efc2e68f41
15 July 2024
Bajgar, O.
4e5b5636-21c7-462b-899f-dcd1845368b9
Abate, A.
fd47d8e5-3b83-432d-ba1e-32cc3372c854
Gatsis, K.
f808d11b-38f1-4a44-ba56-3364d63558d7
Osborne, M.A.
73eecc74-5f67-4f6e-9a99-55efc2e68f41
[Unknown type: UNSPECIFIED]
Abstract
The goal of Bayesian inverse reinforcement learning (IRL) is recovering a posterior distribution over reward functions using a set of demonstrations from an expert optimizing for a reward unknown to the learner. The resulting posterior over rewards can then be used to synthesize an apprentice policy that performs well on the same or a similar task. A key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood, often defined in terms of Q values: vanilla Bayesian IRL needs to solve the costly forward planning problem - going from rewards to the Q values - at every step of the algorithm, which may need to be done thousands of times. We propose to solve this by a simple change: instead of focusing on primarily sampling in the space of rewards, we can focus on primarily working in the space of Q-values, since the computation required to go from Q-values to reward is radically cheaper. Furthermore, this reversion of the computation makes it easy to compute the gradient allowing efficient sampling using Hamiltonian Monte Carlo. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight - and illustrate its advantages on several tasks.
Text
2407.10971v1
- Author's Original
More information
Published date: 15 July 2024
Identifiers
Local EPrints ID: 494529
URI: http://eprints.soton.ac.uk/id/eprint/494529
PURE UUID: 2f88099b-9ec7-46a8-901f-2390119b1bbb
Catalogue record
Date deposited: 10 Oct 2024 16:34
Last modified: 11 Oct 2024 02:08
Export record
Altmetrics
Contributors
Author:
O. Bajgar
Author:
A. Abate
Author:
K. Gatsis
Author:
M.A. Osborne
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics