Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

YuriiFormer: A Suite of Nesterov-Accelerated Transformers

About

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag (val)
Accuracy31.58
54
Language ModelingTinyStories (val)
Last Loss2.4041
21
Question AnsweringARC-Easy (val)
Accuracy43.06
14
Language ModelingTinyStories 10k (val)
Validation Loss (nats/token)1.1317
7
Language ModelingOpenWebText 30k (val)
Loss (nats/token)2.9413
6
Showing 5 of 5 rows

Other info

Follow for update