Muon is Scalable for LLM Training
About
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC Challenge | Accuracy58.28 | 749 | |
| Question Answering | OpenBookQA | Accuracy45.6 | 465 | |
| Question Answering | ARC Easy | Accuracy82.49 | 386 | |
| Natural Language Inference | RTE | Accuracy65.7 | 367 | |
| Boolean Question Answering | BoolQ | Accuracy80.4 | 307 | |
| Question Answering | BoolQ | Accuracy80.4 | 240 | |
| Commonsense Reasoning | WinoGrande | Accuracy71.11 | 231 | |
| Multitask Language Understanding | MMLU | Accuracy67.3 | 206 | |
| Common Sense Reasoning | COPA | Accuracy92 | 138 | |
| Question Answering | OpenBookQA | Accuracy45.6 | 84 |