Reinforcement learning with limited prior knowledge in long-term environments

Increasingly, artificial learning systems are expected to overcome complex and openended problems in long-term environments, where there is limited knowledge about the task to solve, the learners receive limited observations and sparse feedback, the designer has no control over the environment, and unknown tasks may present at random times to the learner. These features are still challenging for reinforcement learning systems, because the best learning algorithm and the best hyperparameters are not known a priori. Deep reinforcement learning methods are recommended but are limited in the number of patterns they can learn and memorise. To overcome this capacity issue, this thesis investigates long-term adaptivity to improve and analyse reinforcement learning in long-term unknown environments. A first case study in non-episodic mazes with sparse rewards illustrates a novel learning type called active adaptive perception, which actively adapts how to use and modify perception based on a long-term utility function. Such learning systems are here shown to construct emergent long-term strategies to avoid detracting corridors and rooms in non-episodic mazes, where a state-of-the-art deep reinforcement learning system DRQN gets stuck. A consequent case study in lifelong learning, where reinforcement learners must solve different tasks presented in sequence. It is shown that multiple policies each specialised on a subset of the tasks can be used as a source of performance improvement as well as a metric for task capacity, how many tasks a single learner can learn and remember. The case study demonstrates that the DRQN learner has low task capacity compared to an alternative deep reinforcement learning system PPO. The results indicate that this is because PPO’s slower learning allows improved long-term adaptation to different tasks. An additional finding is that adaptively learning which policy to use can be beneficial if the policies are sufficiently different from each other. On the same case study, an additional result shows that, when using a long-term utility function to evaluate performance, a correction for the different reward functions is beneficial to avoid forgetting.

Reinforcement Learning, deep learning, deep neural networks, meta-learning, lifelong learning

University of Southampton

Bossens, David

633a4d28-2e59-4343-98fe-283082ba1873

May 2020

Bossens, David

633a4d28-2e59-4343-98fe-283082ba1873

Sobey, Adam James

a5999661-2987-47a8-9abf-1a50ae34c39a

Townsend, Nicholas

3a4b47c5-0e76-4ae0-a086-cf841d610ef0

Bossens, David (2020) Reinforcement learning with limited prior knowledge in long-term environments. University of Southampton, Doctoral Thesis, 205pp.

Record type: Thesis (Doctoral)