YuriiFormer: A Suite of Nesterov-Accelerated Transformers
About
We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag (val) | Accuracy31.58 | 54 | |
| Language Modeling | TinyStories (val) | Last Loss2.4041 | 21 | |
| Question Answering | ARC-Easy (val) | Accuracy43.06 | 14 | |
| Language Modeling | TinyStories 10k (val) | Validation Loss (nats/token)1.1317 | 7 | |
| Language Modeling | OpenWebText 30k (val) | Loss (nats/token)2.9413 | 6 |