Reinforcement learning with limited prior knowledge in long-term environments
Reinforcement learning with limited prior knowledge in long-term environments
Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.
Reinforcement Learning, deep learning, deep neural networks, meta-learning, lifelong learning
University of Southampton
Bossens, David
633a4d28-2e59-4343-98fe-283082ba1873
May 2020
Bossens, David
633a4d28-2e59-4343-98fe-283082ba1873
Sobey, Adam James
a5999661-2987-47a8-9abf-1a50ae34c39a
Townsend, Nicholas
3a4b47c5-0e76-4ae0-a086-cf841d610ef0
Bossens, David
(2020)
Reinforcement learning with limited prior knowledge in long-term environments.
University of Southampton, Doctoral Thesis, 205pp.
Record type:
Thesis
(Doctoral)
Abstract
Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.
Text
David Bossens PHD Fluid Structure Interactions 16may 2020
- Version of Record
Text
Permission to deposit thesis form
Restricted to Repository staff only
More information
Published date: May 2020
Keywords:
Reinforcement Learning, deep learning, deep neural networks, meta-learning, lifelong learning
Identifiers
Local EPrints ID: 442596
URI: http://eprints.soton.ac.uk/id/eprint/442596
PURE UUID: 4ed1fce1-9ac3-4533-86c2-ece74efd9559
Catalogue record
Date deposited: 20 Jul 2020 16:36
Last modified: 06 Jun 2024 01:44
Export record
Contributors
Author:
David Bossens
Thesis advisor:
Adam James Sobey
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics