Turbo-Muon: Almost-Orthogonal Pre-Conditioning for Fast Muon Updates

About

Orthogonality-based optimizers, such as Muon, have recently shown strong performance across large-scale training and community-driven efficiency challenges. However, these methods rely on a costly gradient orthogonalization step. Even efficient iterative approximations such as Newton-Schulz remain expensive, typically requiring dozens of matrix multiplications to converge. We introduce a pre-conditioning procedure that improves the initialization of the Newton--Schulz iterations while incurring negligible overhead. Furthermore, our pre-conditioning reduces the initial polar error and enables the removal of one Newton-Schulz iteration (out of the five iterations usually used in practice). The resulting implementation significantly reduces Muon's overhead. At the end-to-end training level, we observe consistent runtime improvements across speed-run and standard benchmarks, including $\sim$3% reductions in training time on multiple fast training benchmarks, while matching reference performance on both language and vision tasks. Crucially, these improvements require no hyperparameter tuning and can be adopted as a simple drop-in replacement. Beyond empirical gains, we provide theoretical insight into the geometry of the update and its potential robustness against feature collapse. Our code is publicly available on github, in optax and huggingface kernels.

Thibaut Boissin, Thomas Massena, Franck Mamalet, Mathieu Serrurier• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Easy (test)	Accuracy64.77	74
Language Modeling Evaluation	General Language Evaluation Suite AE, AC, SciQ, MMLU, MMLU-P, HS, OBQA, PIQA, RACE, WG, CSQA, AGI (test)	AE Score67.17	27
Language Modeling	GPT-Base (val)	Validation Perplexity21.93	12
Language Modeling	GPT Small (val)	Validation Perplexity29.66	12
Language Modeling	GPT Pre-training (val)	Validation Perplexity21.91	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord