The University of Southampton
University of Southampton Institutional Repository

Post-trained language models as agents in sequential games

Post-trained language models as agents in sequential games
Post-trained language models as agents in sequential games
Recent studies have found that Reinforcement Learning (RL) can endow a pre-trained Large Language Models (LLM) with improved capabilities on tasks with verifiable outcomes, removing the need for training data or explicit human feedback. This opens the door to new applications for LLMs that would previously have required a prohibitively large amount of human-generated data. In this study, we extend the Group Relative Policy Optimization (GRPO) RL algorithm for post-training LLMs on environments requiring sequential decision making. This approach allows us to integrate the innate knowledge and reasoning capabilities of LLMs in to the decision making process, thereby improving the generalization capabilities of the agent while simultaneously enhancing explainability through the model's natural language reasoning about its actions.

We show that by post-training an LLM of only 3 billion parameters, we can develop environment-specific decision-making capabilities comparable to those of more powerful pre-trained models. Specifically, we find that the LLM learns an appropriate strategy for reasoning about its next-best action in a multi-agent Snake game, and to generate its responses in a prescribed format. Further, we show that this learned strategy enables the LLM to improve its performance on previously unseen variations of the Snake game. Finally, we propose a method for sampling training episodes from a larger batch of generated episodes and demonstrate that it improves both performance on the game and convergence speed.
Large Language Models, GRPO, Reinforcement Learning
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b

Dilkes, Jim, Yazdanpanah, Vahid and Stein, Sebastian (2025) Post-trained language models as agents in sequential games. The Third UK AI Conference 2025, The Gibbs Building, London, United Kingdom. 23 - 24 Jun 2025. 1 pp .

Record type: Conference or Workshop Item (Poster)

Abstract

Recent studies have found that Reinforcement Learning (RL) can endow a pre-trained Large Language Models (LLM) with improved capabilities on tasks with verifiable outcomes, removing the need for training data or explicit human feedback. This opens the door to new applications for LLMs that would previously have required a prohibitively large amount of human-generated data. In this study, we extend the Group Relative Policy Optimization (GRPO) RL algorithm for post-training LLMs on environments requiring sequential decision making. This approach allows us to integrate the innate knowledge and reasoning capabilities of LLMs in to the decision making process, thereby improving the generalization capabilities of the agent while simultaneously enhancing explainability through the model's natural language reasoning about its actions.

We show that by post-training an LLM of only 3 billion parameters, we can develop environment-specific decision-making capabilities comparable to those of more powerful pre-trained models. Specifically, we find that the LLM learns an appropriate strategy for reasoning about its next-best action in a multi-agent Snake game, and to generate its responses in a prescribed format. Further, we show that this learned strategy enables the LLM to improve its performance on previously unseen variations of the Snake game. Finally, we propose a method for sampling training episodes from a larger batch of generated episodes and demonstrate that it improves both performance on the game and convergence speed.

Text
Jim Dilkes - Post-Trained Language Models as Agents in Sequential Games - Version of Record
Download (1MB)

More information

Published date: 23 June 2025
Venue - Dates: The Third UK AI Conference 2025, The Gibbs Building, London, United Kingdom, 2025-06-23 - 2025-06-24
Keywords: Large Language Models, GRPO, Reinforcement Learning

Identifiers

Local EPrints ID: 503242
URI: http://eprints.soton.ac.uk/id/eprint/503242
PURE UUID: e70edd74-63ac-4a49-b873-5d7b5a458260
ORCID for Jim Dilkes: ORCID iD orcid.org/0000-0002-5158-4611
ORCID for Vahid Yazdanpanah: ORCID iD orcid.org/0000-0002-4468-6193
ORCID for Sebastian Stein: ORCID iD orcid.org/0000-0003-2858-8857

Catalogue record

Date deposited: 25 Jul 2025 16:30
Last modified: 05 Aug 2025 02:12

Export record

Contributors

Author: Jim Dilkes ORCID iD
Author: Vahid Yazdanpanah ORCID iD
Author: Sebastian Stein ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×