MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

About

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai• 2026

Related benchmarks

Task	Dataset	Result
Logical reasoning	BBH	--	249
Code Generation	LiveCodeBench	--	84
Chinese Multitask Language Understanding	CMMLU	--	69
Mathematical Reasoning	MATH	--	68
Code Generation	FullStackBench	Pass@129.75	48
Code Reasoning	CRUXEval	Accuracy35	36
Code Generation	HumanEval+	--	34
Multi-task Language Understanding	MMLU	MMLU Score63.73	33
Reading Comprehension	DROP	DROP Score48.68	25
Multi-task Language Understanding	MMLU-Pro	MMLU-Pro Score33.75	22

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord