Post-trained language models as agents in sequential games
Post-trained language models as agents in sequential games
Recent studies have found that Reinforcement Learning (RL) can endow a pre-trained Large Language Models (LLM) with improved capabilities on tasks with verifiable outcomes, removing the need for training data or explicit human feedback. This opens the door to new applications for LLMs that would previously have required a prohibitively large amount of human-generated data. In this study, we extend the Group Relative Policy Optimization (GRPO) RL algorithm for post-training LLMs on environments requiring sequential decision making. This approach allows us to integrate the innate knowledge and reasoning capabilities of LLMs in to the decision making process, thereby improving the generalization capabilities of the agent while simultaneously enhancing explainability through the model's natural language reasoning about its actions.
We show that by post-training an LLM of only 3 billion parameters, we can develop environment-specific decision-making capabilities comparable to those of more powerful pre-trained models. Specifically, we find that the LLM learns an appropriate strategy for reasoning about its next-best action in a multi-agent Snake game, and to generate its responses in a prescribed format. Further, we show that this learned strategy enables the LLM to improve its performance on previously unseen variations of the Snake game. Finally, we propose a method for sampling training episodes from a larger batch of generated episodes and demonstrate that it improves both performance on the game and convergence speed.
Large Language Models, GRPO, Reinforcement Learning
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
23 June 2025
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
Dilkes, Jim, Yazdanpanah, Vahid and Stein, Sebastian
(2025)
Post-trained language models as agents in sequential games.
The Third UK AI Conference 2025, The Gibbs Building, London, United Kingdom.
23 - 24 Jun 2025.
1 pp
.
Record type:
Conference or Workshop Item
(Poster)
Abstract
Recent studies have found that Reinforcement Learning (RL) can endow a pre-trained Large Language Models (LLM) with improved capabilities on tasks with verifiable outcomes, removing the need for training data or explicit human feedback. This opens the door to new applications for LLMs that would previously have required a prohibitively large amount of human-generated data. In this study, we extend the Group Relative Policy Optimization (GRPO) RL algorithm for post-training LLMs on environments requiring sequential decision making. This approach allows us to integrate the innate knowledge and reasoning capabilities of LLMs in to the decision making process, thereby improving the generalization capabilities of the agent while simultaneously enhancing explainability through the model's natural language reasoning about its actions.
We show that by post-training an LLM of only 3 billion parameters, we can develop environment-specific decision-making capabilities comparable to those of more powerful pre-trained models. Specifically, we find that the LLM learns an appropriate strategy for reasoning about its next-best action in a multi-agent Snake game, and to generate its responses in a prescribed format. Further, we show that this learned strategy enables the LLM to improve its performance on previously unseen variations of the Snake game. Finally, we propose a method for sampling training episodes from a larger batch of generated episodes and demonstrate that it improves both performance on the game and convergence speed.
Text
Jim Dilkes - Post-Trained Language Models as Agents in Sequential Games
- Version of Record
More information
Published date: 23 June 2025
Venue - Dates:
The Third UK AI Conference 2025, The Gibbs Building, London, United Kingdom, 2025-06-23 - 2025-06-24
Keywords:
Large Language Models, GRPO, Reinforcement Learning
Identifiers
Local EPrints ID: 503242
URI: http://eprints.soton.ac.uk/id/eprint/503242
PURE UUID: e70edd74-63ac-4a49-b873-5d7b5a458260
Catalogue record
Date deposited: 25 Jul 2025 16:30
Last modified: 05 Aug 2025 02:12
Export record
Contributors
Author:
Jim Dilkes
Author:
Vahid Yazdanpanah
Author:
Sebastian Stein
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics