MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
About
The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Logical reasoning | BBH | -- | 249 | |
| Code Generation | LiveCodeBench | -- | 84 | |
| Chinese Multitask Language Understanding | CMMLU | -- | 67 | |
| Code Generation | FullStackBench | Pass@129.75 | 48 | |
| Mathematical Reasoning | MATH | -- | 46 | |
| Code Reasoning | CRUXEval | Accuracy35 | 36 | |
| Code Generation | HumanEval+ | -- | 34 | |
| Reading Comprehension | DROP | DROP Score48.68 | 25 | |
| Multi-task Language Understanding | MMLU | MMLU Score63.73 | 21 | |
| Science Question Answering | GPQA | Score0.2555 | 16 |