The University of Southampton
University of Southampton Institutional Repository

Reinforced language models for sequential decision making

Reinforced language models for sequential decision making
Reinforced language models for sequential decision making
Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.
Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b
Dilkes, Jim
f64f01b1-79e2-4c6c-aa2f-9fd1ee430a21
Yazdanpanah, Vahid
28f82058-5e51-4f56-be14-191ab5767d56
Stein, Sebastian
cb2325e7-5e63-475e-8a69-9db2dfbdb00b

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Large Language Models (LLMs) show potential as sequential decision-making agents, but their application is often limited due to a reliance on large, computationally expensive models. This creates a need to improve smaller models, yet existing post-training methods are designed for single-turn interactions and cannot handle credit assignment in multi-step agentic tasks. To address this, we introduce Multi-Step Group-Relative Policy Optimization (MS-GRPO), a new algorithm for post-training LLM agents, grounded in formal Text-Mediated Stochastic Game (TSMG) and Language-Agent Policy (LAP) frameworks. For credit assignment, MS-GRPO attributes the entire cumulative episode reward to each individual episode step. We supplement this algorithm with a novel absolute-advantage-weighted episode sampling strategy that we show improves training performance. We evaluate our approach by post-training a 3-billion parameter model on Snake and Frozen Lake. Our experiments demonstrate that the method is effective in improving decision-making performance: our post-trained 3B parameter model outperforms a 72B parameter baseline by 50% on the Frozen Lake task. This work demonstrates that targeted post-training is a practical and efficient alternative to relying on model scale for creating sequential decision-making agents using LLMs.

Text
2508.10839v1 - Author's Original
Available under License Creative Commons Attribution.
Download (1MB)

More information

Published date: 14 August 2025
Keywords: Artifical Intelligence, language models, Sequential Decision Making, Reinforcement Learning

Identifiers

Local EPrints ID: 505168
URI: http://eprints.soton.ac.uk/id/eprint/505168
PURE UUID: 0f9ea418-8408-4cf8-8f31-78f31629eece
ORCID for Jim Dilkes: ORCID iD orcid.org/0000-0002-5158-4611
ORCID for Vahid Yazdanpanah: ORCID iD orcid.org/0000-0002-4468-6193
ORCID for Sebastian Stein: ORCID iD orcid.org/0000-0003-2858-8857

Catalogue record

Date deposited: 01 Oct 2025 16:37
Last modified: 02 Oct 2025 02:16

Export record

Altmetrics

Contributors

Author: Jim Dilkes ORCID iD
Author: Vahid Yazdanpanah ORCID iD
Author: Sebastian Stein ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×