The University of Southampton
University of Southampton Institutional Repository

A solvable attention for neural scaling laws

A solvable attention for neural scaling laws
A solvable attention for neural scaling laws
Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
Lyu, Bochen
fb1af04c-d0d6-490b-a238-73c1dd1aed2a
Wang, Di
91366efe-c127-4a23-9787-f6ed624b15b4
Zhu, Zhanxing
e55e7385-8ba2-4a85-8bae-e00defb7d7f0
Lyu, Bochen
fb1af04c-d0d6-490b-a238-73c1dd1aed2a
Wang, Di
91366efe-c127-4a23-9787-f6ed624b15b4
Zhu, Zhanxing
e55e7385-8ba2-4a85-8bae-e00defb7d7f0

Lyu, Bochen, Wang, Di and Zhu, Zhanxing (2025) A solvable attention for neural scaling laws. In International Conference on Learning Representation (ICLR) , 2025. 40 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.

Text
6212_A_Solvable_Attention_for_ - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Published date: April 2025
Venue - Dates: The Thirteenth International Conference on Learning Representations, , Singapore, Singapore, 2025-04-24 - 2025-04-28

Identifiers

Local EPrints ID: 500728
URI: http://eprints.soton.ac.uk/id/eprint/500728
PURE UUID: ac43a36e-a885-417f-a221-51be76467175
ORCID for Zhanxing Zhu: ORCID iD orcid.org/0000-0002-2141-6553

Catalogue record

Date deposited: 12 May 2025 16:39
Last modified: 22 Aug 2025 02:42

Export record

Contributors

Author: Bochen Lyu
Author: Di Wang
Author: Zhanxing Zhu ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×