Reinforced language models for sequential decision making
Reinforced language models for sequential decision making
Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
14 August 2025
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
[Unknown type: UNSPECIFIED]
Abstract
Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
Text
2508.10839v1
- Author's Original
More information
Published date: 14 August 2025
Keywords:
Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning
Identifiers
Local EPrints ID: 505168
URI: http://eprints.soton.ac.uk/id/eprint/505168
PURE UUID: 0f9ea418-8408-4cf8-8f31-78f31629eece
Catalogue record
Date deposited: 01 Oct 2025 16:37
Last modified: 02 Oct 2025 02:16
Export record
Altmetrics
Contributors
Author:
Jim Dilkes
Author:
Vahid Yazdanpanah
Author:
Sebastian Stein
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics