A solvable attention for neural scaling laws
A solvable attention for neural scaling laws
Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
Lyu, Bochen
fb1af04c-d0d6-490b-a238-73c1dd1aed2a
Wang, Di
91366efe-c127-4a23-9787-f6ed624b15b4
Zhu, Zhanxing
e55e7385-8ba2-4a85-8bae-e00defb7d7f0
April 2025
Lyu, Bochen
fb1af04c-d0d6-490b-a238-73c1dd1aed2a
Wang, Di
91366efe-c127-4a23-9787-f6ed624b15b4
Zhu, Zhanxing
e55e7385-8ba2-4a85-8bae-e00defb7d7f0
Lyu, Bochen, Wang, Di and Zhu, Zhanxing
(2025)
A solvable attention for neural scaling laws.
In International Conference on Learning Representation (ICLR) , 2025.
40 pp
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.
Text
6212_A_Solvable_Attention_for_
- Version of Record
More information
Published date: April 2025
Venue - Dates:
The Thirteenth International Conference on Learning Representations, , Singapore, Singapore, 2025-04-24 - 2025-04-28
Identifiers
Local EPrints ID: 500728
URI: http://eprints.soton.ac.uk/id/eprint/500728
PURE UUID: ac43a36e-a885-417f-a221-51be76467175
Catalogue record
Date deposited: 12 May 2025 16:39
Last modified: 22 Aug 2025 02:42
Export record
Contributors
Author:
Bochen Lyu
Author:
Di Wang
Author:
Zhanxing Zhu
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics