AdaMuon: Adaptive Muon Optimizer

About

We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.

Chongjie Si, Debing Zhang, Wei Shen• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	C4	Perplexity22.43	1565
Image Classification	CIFAR-100 (val)	--	781
Question Answering	ARC Easy (test)	Accuracy63.17	74
Language Modeling Evaluation	General Language Evaluation Suite AE, AC, SciQ, MMLU, MMLU-P, HS, OBQA, PIQA, RACE, WG, CSQA, AGI (test)	AE Score67.42	27
Language Modeling	GPT Small (val)	Validation Perplexity29.3	12
Language Modeling	GPT-Base (val)	Validation Perplexity22.42	12
Language Modeling	GPT Pre-training (val)	Validation Perplexity22.38	8
Language Understanding	Language Understanding Benchmarks (HellaSwag, MMLU, NIAH, PIQA, ARC_c, C-Eval, TriviaQA, OBQA, WinoG, CHID, CMMLU, GSM8k) (test)	HellaSwag Score25.13	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord