Explaining the future context of deep reinforcement learning agents’ decision-making
Explaining the future context of deep reinforcement learning agents’ decision-making
Deep reinforcement learning has achieved superhuman performance in numerous environments. Despite these advances, there are limited tools to understand why agents make decisions. A central issue is how specific actions enable agents to collect rewards or achieve goals far in the future. Understanding this future context for an agent's decision-making is critical to explaining their choices. To date, however, little research has explored such temporal explanations. Therefore, we investigate how to explain the future context of agents’ decision-making for both pretrained agents, using a memory of past behaviour, and architecturally modified agents, explicitly outputting their next $N$ expected rewards. We evaluate these explanations with user surveys, finding them preferred and more effective to baseline algorithms in Atari environments.
We develop three novel video-based explanations for pretrained agents. Two of these require no domain knowledge, a common feature of prior work, while the third incorporates limited domain knowledge. These approaches are the first local explanations that use a memory of how an agent acted in the past to explain their current decision-making. We collect similar decisions from past states or skills, showcasing them to users to help visualise an action’s possible future outcomes.
We identify that deep reinforcement learning agents implicitly compute their beliefs about the future when predicting their rewards (i.e., Q-value or State-value). From this, we prove that an agent's Q-value can be transformed into computing the expected reward for each future timestep. This opens up the opportunity to explain an agent's confidence and decision-making for individual future timesteps. This innovation allows us to propose a novel training algorithm referred to as Temporal Reward Decomposition, where agents output their expected rewards for the next N timesteps. From this, we pioneer three novel explanations for users with a strong understanding of reinforcement learning. For non-technical users, we propose a fourth explanation using Large Language Models to summarise the future rewards in natural language.
We conduct two user surveys to evaluate our temporal explanations against two baseline algorithms. In the second, we propose a novel evaluation methodology inspired by debugging, where users must identify an unknown agent's goal from an explanation of its decision-making. We find that in both user surveys, our temporal explanations were preferred and, in the second, were significantly more effective for determining an agent's goal.
Explainable Reinforcement Learning
University of Southampton
Towers, Mark
18e6acc7-29c4-4d0c-9058-32d180ad4f12
2025
Towers, Mark
18e6acc7-29c4-4d0c-9058-32d180ad4f12
Norman, Tim
663e522f-807c-4569-9201-dc141c8eb50d
Du, Yali
0b0d4eef-0820-4753-b384-72db5058df32
Freeman, Chris
ccdd1272-cdc7-43fb-a1bb-b1ef0bdf5815
Towers, Mark
(2025)
Explaining the future context of deep reinforcement learning agents’ decision-making.
University of Southampton, Doctoral Thesis, 180pp.
Record type:
Thesis
(Doctoral)
Abstract
Deep reinforcement learning has achieved superhuman performance in numerous environments. Despite these advances, there are limited tools to understand why agents make decisions. A central issue is how specific actions enable agents to collect rewards or achieve goals far in the future. Understanding this future context for an agent's decision-making is critical to explaining their choices. To date, however, little research has explored such temporal explanations. Therefore, we investigate how to explain the future context of agents’ decision-making for both pretrained agents, using a memory of past behaviour, and architecturally modified agents, explicitly outputting their next $N$ expected rewards. We evaluate these explanations with user surveys, finding them preferred and more effective to baseline algorithms in Atari environments.
We develop three novel video-based explanations for pretrained agents. Two of these require no domain knowledge, a common feature of prior work, while the third incorporates limited domain knowledge. These approaches are the first local explanations that use a memory of how an agent acted in the past to explain their current decision-making. We collect similar decisions from past states or skills, showcasing them to users to help visualise an action’s possible future outcomes.
We identify that deep reinforcement learning agents implicitly compute their beliefs about the future when predicting their rewards (i.e., Q-value or State-value). From this, we prove that an agent's Q-value can be transformed into computing the expected reward for each future timestep. This opens up the opportunity to explain an agent's confidence and decision-making for individual future timesteps. This innovation allows us to propose a novel training algorithm referred to as Temporal Reward Decomposition, where agents output their expected rewards for the next N timesteps. From this, we pioneer three novel explanations for users with a strong understanding of reinforcement learning. For non-technical users, we propose a fourth explanation using Large Language Models to summarise the future rewards in natural language.
We conduct two user surveys to evaluate our temporal explanations against two baseline algorithms. In the second, we propose a novel evaluation methodology inspired by debugging, where users must identify an unknown agent's goal from an explanation of its decision-making. We find that in both user surveys, our temporal explanations were preferred and, in the second, were significantly more effective for determining an agent's goal.
Text
archival phd_thesis
- Version of Record
Text
Final-thesis-submission-Examination-Mr-Mark-Towers
Restricted to Repository staff only
More information
Published date: 2025
Keywords:
Explainable Reinforcement Learning
Identifiers
Local EPrints ID: 502074
URI: http://eprints.soton.ac.uk/id/eprint/502074
PURE UUID: d0f52859-23c5-4527-a42b-de0d9fc5b223
Catalogue record
Date deposited: 16 Jun 2025 16:38
Last modified: 11 Sep 2025 03:18
Export record
Contributors
Author:
Mark Towers
Thesis advisor:
Yali Du
Thesis advisor:
Chris Freeman
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics