Post-trained language models as agents in sequential games

Recent studies have found that Reinforcement Learning (RL) can endow a pre-trained Large Language Models (LLM) with improved capabilities on tasks with verifiable outcomes, removing the need for training data or explicit human feedback. This opens the door to new applications for LLMs that would previously have required a prohibitively large amount of human-generated data. In this study, we extend the Group Relative Policy Optimization (GRPO) RL algorithm for post-training LLMs on environments requiring sequential decision making. This approach allows us to integrate the innate knowledge and reasoning capabilities of LLMs in to the decision making process, thereby improving the generalization capabilities of the agent while simultaneously enhancing explainability through the model's natural language reasoning about its actions.

We show that by post-training an LLM of only 3 billion parameters, we can develop environment-specific decision-making capabilities comparable to those of more powerful pre-trained models. Specifically, we find that the LLM learns an appropriate strategy for reasoning about its next-best action in a multi-agent Snake game, and to generate its responses in a prescribed format. Further, we show that this learned strategy enables the LLM to improve its performance on previously unseen variations of the Snake game. Finally, we propose a method for sampling training episodes from a larger batch of generated episodes and demonstrate that it improves both performance on the game and convergence speed.

Large Language Models, GRPO, Reinforcement Learning

Dilkes, Jim

f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21

Yazdanpanah, Vahid

28f82058-5e51-4f56-be14-191ab5767d56

Stein, Sebastian

cb2325e7-5e63-475e-8a69-9db2dfbdb00b

23 June 2025

Dilkes, Jim

f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21

Yazdanpanah, Vahid

28f82058-5e51-4f56-be14-191ab5767d56

Stein, Sebastian

cb2325e7-5e63-475e-8a69-9db2dfbdb00b

Dilkes, Jim, Yazdanpanah, Vahid and Stein, Sebastian (2025) Post-trained language models as agents in sequential games. The Third UK AI Conference 2025, The Gibbs Building, London, United Kingdom. 23 - 24 Jun 2025. 1 pp .