Effective Quantization of Muon Optimizer States

About

The Muon optimizer, based on matrix orthogonalization, has recently shown faster convergence and better computational efficiency over AdamW in LLM pre-training. However, the memory overhead of maintaining high-precision optimizer states remains a challenge for large-scale deployment. In this paper, we introduce the 8-bit Muon optimizer using blockwise quantization. In extensive Chinchilla-optimal experiments on pre-training models of up to 2.7B in size and fine-tuning them for instruction following, we demonstrate that 8-bit Muon achieves parity with Muon in terms of validation loss and downstream benchmarks, while achieving up to a 62\% reduction in optimizer state footprint. Crucially, we show that Muon's update mechanism is uniquely compatible with a simple linear quantization scheme, bypassing the complex dynamic scaling required for quantized AdamW. We supplement our empirical findings with a theoretical analysis of Muon's robustness to quantization noise.

Aman Gupta, Rafael Celente, Abhishek Shivanna, D.T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, S. Sathiya Keerthi• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy53.7	1442
Question Answering	ARC Easy	--	597
Question Answering	OpenBookQA	Accuracy31.4	305
Reading Comprehension	BoolQ	Accuracy (BoolQ)61.3	228
Physical Commonsense Reasoning	PIQA	Accuracy (PIQA)69.6	99
Commonsense Reasoning	HellaSwag	Accuracy41.2	97
Instruction Following	AlpacaEval 2.0 (test)	LC Win Rate (%)59.93	95
Zero-shot NLP Evaluation	NLP Downstream Benchmarks (ARC, BoolQ, HellaSwag, LAMBADA, MMLU)	ARC-C0.352	15

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord