YuriiFormer: A Suite of Nesterov-Accelerated Transformers

About

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag (val)	Accuracy31.58	68
Language Modeling	TinyStories (val)	Last Loss2.4041	21
Question Answering	ARC-Easy (val)	Accuracy43.06	19
Language Modeling	TinyStories 10k (val)	Validation Loss (nats/token)1.1317	7
Language Modeling	OpenWebText 30k (val)	Loss (nats/token)2.9413	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord