The University of Southampton
University of Southampton Institutional Repository

Explaining the future context of deep reinforcement learning agents’ decision-making

Explaining the future context of deep reinforcement learning agents’ decision-making
Explaining the future context of deep reinforcement learning agents’ decision-making
Deep reinforcement learning has achieved superhuman performance in numerous environments. Despite these advances, there are limited tools to understand why agents make decisions. A central issue is how specific actions enable agents to collect rewards or achieve goals far in the future. Understanding this future context for an agent's decision-making is critical to explaining their choices. To date, however, little research has explored such temporal explanations. Therefore, we investigate how to explain the future context of agents’ decision-making for both pretrained agents, using a memory of past behaviour, and architecturally modified agents, explicitly outputting their next $N$ expected rewards. We evaluate these explanations with user surveys, finding them preferred and more effective to baseline algorithms in Atari environments.

We develop three novel video-based explanations for pretrained agents. Two of these require no domain knowledge, a common feature of prior work, while the third incorporates limited domain knowledge. These approaches are the first local explanations that use a memory of how an agent acted in the past to explain their current decision-making. We collect similar decisions from past states or skills, showcasing them to users to help visualise an action’s possible future outcomes.

We identify that deep reinforcement learning agents implicitly compute their beliefs about the future when predicting their rewards (i.e., Q-value or State-value). From this, we prove that an agent's Q-value can be transformed into computing the expected reward for each future timestep. This opens up the opportunity to explain an agent's confidence and decision-making for individual future timesteps. This innovation allows us to propose a novel training algorithm referred to as Temporal Reward Decomposition, where agents output their expected rewards for the next N timesteps. From this, we pioneer three novel explanations for users with a strong understanding of reinforcement learning. For non-technical users, we propose a fourth explanation using Large Language Models to summarise the future rewards in natural language.

We conduct two user surveys to evaluate our temporal explanations against two baseline algorithms. In the second, we propose a novel evaluation methodology inspired by debugging, where users must identify an unknown agent's goal from an explanation of its decision-making. We find that in both user surveys, our temporal explanations were preferred and, in the second, were significantly more effective for determining an agent's goal.
Explainable Reinforcement Learning
University of Southampton
Towers, Mark
18e6acc7-29c4-4d0c-9058-32d180ad4f12
Towers, Mark
18e6acc7-29c4-4d0c-9058-32d180ad4f12
Norman, Tim
663e522f-807c-4569-9201-dc141c8eb50d
Du, Yali
0b0d4eef-0820-4753-b384-72db5058df32
Freeman, Chris
ccdd1272-cdc7-43fb-a1bb-b1ef0bdf5815

Towers, Mark (2025) Explaining the future context of deep reinforcement learning agents’ decision-making. University of Southampton, Doctoral Thesis, 180pp.

Record type: Thesis (Doctoral)

Abstract

Deep reinforcement learning has achieved superhuman performance in numerous environments. Despite these advances, there are limited tools to understand why agents make decisions. A central issue is how specific actions enable agents to collect rewards or achieve goals far in the future. Understanding this future context for an agent's decision-making is critical to explaining their choices. To date, however, little research has explored such temporal explanations. Therefore, we investigate how to explain the future context of agents’ decision-making for both pretrained agents, using a memory of past behaviour, and architecturally modified agents, explicitly outputting their next $N$ expected rewards. We evaluate these explanations with user surveys, finding them preferred and more effective to baseline algorithms in Atari environments.

We develop three novel video-based explanations for pretrained agents. Two of these require no domain knowledge, a common feature of prior work, while the third incorporates limited domain knowledge. These approaches are the first local explanations that use a memory of how an agent acted in the past to explain their current decision-making. We collect similar decisions from past states or skills, showcasing them to users to help visualise an action’s possible future outcomes.

We identify that deep reinforcement learning agents implicitly compute their beliefs about the future when predicting their rewards (i.e., Q-value or State-value). From this, we prove that an agent's Q-value can be transformed into computing the expected reward for each future timestep. This opens up the opportunity to explain an agent's confidence and decision-making for individual future timesteps. This innovation allows us to propose a novel training algorithm referred to as Temporal Reward Decomposition, where agents output their expected rewards for the next N timesteps. From this, we pioneer three novel explanations for users with a strong understanding of reinforcement learning. For non-technical users, we propose a fourth explanation using Large Language Models to summarise the future rewards in natural language.

We conduct two user surveys to evaluate our temporal explanations against two baseline algorithms. In the second, we propose a novel evaluation methodology inspired by debugging, where users must identify an unknown agent's goal from an explanation of its decision-making. We find that in both user surveys, our temporal explanations were preferred and, in the second, were significantly more effective for determining an agent's goal.

Text
archival phd_thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (9MB)
Text
Final-thesis-submission-Examination-Mr-Mark-Towers
Restricted to Repository staff only

More information

Published date: 2025
Keywords: Explainable Reinforcement Learning

Identifiers

Local EPrints ID: 502074
URI: http://eprints.soton.ac.uk/id/eprint/502074
PURE UUID: d0f52859-23c5-4527-a42b-de0d9fc5b223
ORCID for Mark Towers: ORCID iD orcid.org/0000-0002-2609-2041
ORCID for Tim Norman: ORCID iD orcid.org/0000-0002-6387-4034
ORCID for Chris Freeman: ORCID iD orcid.org/0000-0003-0305-9246

Catalogue record

Date deposited: 16 Jun 2025 16:38
Last modified: 11 Sep 2025 03:18

Export record

Contributors

Author: Mark Towers ORCID iD
Thesis advisor: Tim Norman ORCID iD
Thesis advisor: Yali Du
Thesis advisor: Chris Freeman ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×