The University of Southampton
University of Southampton Institutional Repository

Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning

Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning
Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning
We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.
cs.LG, cs.AI
2331-8422
Early, Joseph
fd4e9e4c-9251-474d-a9cf-12157a9f2f73
Bewley, Tom
47c4d28a-2396-4e46-8065-448f2adeba19
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3
Early, Joseph
fd4e9e4c-9251-474d-a9cf-12157a9f2f73
Bewley, Tom
47c4d28a-2396-4e46-8065-448f2adeba19
Evers, Christine
93090c84-e984-4cc3-9363-fbf3f3639c4b
Ramchurn, Sarvapali
1d62ae2a-a498-444e-912d-a6082d3aaea3

Early, Joseph, Bewley, Tom, Evers, Christine and Ramchurn, Sarvapali (2022) Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning. arXiv. (doi:10.48550/arXiv.2205.15367).

Record type: Article

Abstract

We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.

Text
2205.15367v1 - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (4MB)

More information

Published date: 30 May 2022
Additional Information: 20 pages (9 main content; 2 references; 9 appendix). 11 figures (8 main content; 3 appendix)
Keywords: cs.LG, cs.AI

Identifiers

Local EPrints ID: 458023
URI: http://eprints.soton.ac.uk/id/eprint/458023
ISSN: 2331-8422
PURE UUID: eb5bb87c-912c-410f-b45b-c099eadc93bf
ORCID for Christine Evers: ORCID iD orcid.org/0000-0003-0757-5504
ORCID for Sarvapali Ramchurn: ORCID iD orcid.org/0000-0001-9686-4302

Catalogue record

Date deposited: 24 Jun 2022 21:51
Last modified: 17 Mar 2024 04:01

Export record

Altmetrics

Contributors

Author: Joseph Early
Author: Tom Bewley
Author: Christine Evers ORCID iD
Author: Sarvapali Ramchurn ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×