Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AdaMuon: Adaptive Muon Optimizer

About

We propose AdaMuon, a novel optimizer that combines element-wise adaptivity with orthogonal updates for large-scale neural network training. AdaMuon incorporates two tightly coupled mechanisms: (1) an element-wise second momentum estimator applied to orthogonalized update directions, and (2) a sign-stabilized orthogonal update, where the momentum is first sign-transformed before orthogonalization. These two components jointly enable variance-adaptive scaling while maintaining stable update geometry. In addition, AdaMuon employs an RMS-aligned rescaling strategy to match the root-mean-square update magnitude to Adam, allowing direct reuse of existing learning rate schedules without extra tuning. Experiments demonstrate that AdaMuon not only maintains stability but can surpass Adam by more than 40\% training efficiency in large-scale scenarios.

Chongjie Si, Debing Zhang, Wei Shen• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity22.43
1565
Question AnsweringARC Easy (test)
Accuracy63.17
74
Language Modeling EvaluationGeneral Language Evaluation Suite AE, AC, SciQ, MMLU, MMLU-P, HS, OBQA, PIQA, RACE, WG, CSQA, AGI (test)
AE Score67.42
27
Language ModelingGPT Small (val)
Validation Perplexity29.3
12
Language ModelingGPT-Base (val)
Validation Perplexity22.42
12
Language ModelingGPT Pre-training (val)
Validation Perplexity22.38
8
Showing 6 of 6 rows

Other info

Follow for update