Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

About

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Yupeng Su, Liyan Tan, Zheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb (val)--
217
Language ModelingGPT Pre-training (val)
Validation Perplexity19.98
8
Pre-training efficiencyPre-training--
4
Commonsense ReasoningCommonsense Reasoning Suite (OBQA, HellaSwag, ARC-E, WSC, Winogrande, BoolQ, PIQA)
Average Accuracy49.4
2
Showing 4 of 4 rows

Other info

Follow for update