The University of Southampton
University of Southampton Institutional Repository

Reinforcement learning with limited prior knowledge in long-term environments

Reinforcement learning with limited prior knowledge in long-term environments
Reinforcement learning with limited prior knowledge in long-term environments
Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.
Reinforcement Learning, deep learning, deep neural networks, meta-learning, lifelong learning
University of Southampton
Bossens, David
633a4d28-2e59-4343-98fe-283082ba1873
Bossens, David
633a4d28-2e59-4343-98fe-283082ba1873
Sobey, Adam James
a5999661-2987-47a8-9abf-1a50ae34c39a
Townsend, Nicholas
3a4b47c5-0e76-4ae0-a086-cf841d610ef0

Bossens, David (2020) Reinforcement learning with limited prior knowledge in long-term environments. University of Southampton, Doctoral Thesis, 205pp.

Record type: Thesis (Doctoral)

Abstract

Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.

Text
David Bossens PHD Fluid Structure Interactions 16may 2020 - Version of Record
Available under License University of Southampton Thesis Licence.
Download (18MB)
Text
Permission to deposit thesis form
Restricted to Repository staff only

More information

Published date: May 2020
Keywords: Reinforcement Learning, deep learning, deep neural networks, meta-learning, lifelong learning

Identifiers

Local EPrints ID: 442596
URI: http://eprints.soton.ac.uk/id/eprint/442596
PURE UUID: 4ed1fce1-9ac3-4533-86c2-ece74efd9559

Catalogue record

Date deposited: 20 Jul 2020 16:36
Last modified: 22 Jul 2020 16:30

Export record

Contributors

Author: David Bossens
Thesis advisor: Adam James Sobey
Thesis advisor: Nicholas Townsend

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×