Reinforced language models for sequential decision making

Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning

10.48550/arXiv.2508.10839

Dilkes, Jim

f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21

Yazdanpanah, Vahid

28f82058-5e51-4f56-be14-191ab5767d56

Stein, Sebastian

cb2325e7-5e63-475e-8a69-9db2dfbdb00b

14 August 2025

Dilkes, Jim

f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21

Yazdanpanah, Vahid

28f82058-5e51-4f56-be14-191ab5767d56

Stein, Sebastian

cb2325e7-5e63-475e-8a69-9db2dfbdb00b

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Text

2508.10839v1 - Author's Original

Available under License Creative Commons Attribution.

Download (1MB)

More information

Published date: 14 August 2025

Keywords: Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning

Learn more about Agents, Interactions and Complexity research Learn more about School of Electronics and Computer Science research

Identifiers

Local EPrints ID: 505168

URI: http://eprints.soton.ac.uk/id/eprint/505168

DOI: doi:10.48550/arXiv.2508.10839

PURE UUID: 0f9ea418-8408-4cf8-8f31-78f31629eece

ORCID for Jim Dilkes:

orcid.org/0000-0002-5158-4611

ORCID for Vahid Yazdanpanah:

orcid.org/0000-0002-4468-6193

ORCID for Sebastian Stein:

orcid.org/0000-0003-2858-8857

Catalogue record

Date deposited: 01 Oct 2025 16:37

Last modified: 02 Oct 2025 02:16

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Jim Dilkes

Author: Vahid Yazdanpanah

Author: Sebastian Stein

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information