Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning
Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning
We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
cs.LG, cs.AI
Early, Joseph
fd4e9e4c-9251-474d-a9cf-12157a9f2f73
Bewley, Tom
47c4d28a-2396-4e46-8065-448f2adeba19
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3
30 May 2022
Early, Joseph
fd4e9e4c-9251-474d-a9cf-12157a9f2f73
Bewley, Tom
47c4d28a-2396-4e46-8065-448f2adeba19
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3
Early, Joseph, Bewley, Tom, Evers, Christine and Ramchurn, Sarvapali
(2022)
Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning.
arXiv.
(doi:10.48550/arXiv.2205.15367).
Abstract
We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
Text
2205.15367v1
- Accepted Manuscript
More information
Published date: 30 May 2022
Additional Information:
20 pages (9 main content; 2 references; 9 appendix). 11 figures (8 main content; 3 appendix)
Keywords:
cs.LG, cs.AI
Identifiers
Local EPrints ID: 458023
URI: http://eprints.soton.ac.uk/id/eprint/458023
ISSN: 2331-8422
PURE UUID: eb5bb87c-912c-410f-b45b-c099eadc93bf
Catalogue record
Date deposited: 24 Jun 2022 21:51
Last modified: 07 Jun 2024 01:57
Export record
Altmetrics
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics