Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Effective Quantization of Muon Optimizer States

About

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.

Aman Gupta, Rafael Celente, Abhishek Shivanna, D.T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval 2.0 (test)
LC Win Rate (%)59.93
71
Zero-shot NLP EvaluationNLP Downstream Benchmarks (ARC, BoolQ, HellaSwag, LAMBADA, MMLU)
ARC-C0.352
15
Showing 2 of 2 rows

Other info

Follow for update