A solvable attention for neural scaling laws

Lyu, Bochen, Wang, Di and Zhu, Zhanxing (2025) A solvable attention for neural scaling laws. In International Conference on Learning Representation (ICLR) , 2025. 40 pp .

Record type: Conference or Workshop Item (Paper)

Abstract

Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for self-attention, the underpinning block of transformer, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.

Text

6212_A_Solvable_Attention_for_ - Version of Record

Available under License Creative Commons Attribution.

Download (1MB)