Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adaptive Optimization via Momentum on Variance-Normalized Gradients

About

We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead.

Francisco Patitucci, Aryan Mokhtari• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100 (test)
Accuracy79.63
3518
Language ModelingWikiText-103 (val)
PPL62.62
180
Image ClassificationCIFAR100 (train)--
8
Language ModelingOpenWebText GPT-2 124M (val)--
8
Language ModelingWikiText-103 (train)
PPL70.01
4
Language ModelingOpenWebText GPT-2 124M NanoGPT (train)
Loss2.9552
4
Showing 6 of 6 rows

Other info

Follow for update