Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

About

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(\epsilon^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc81.64	1239
Image Classification	CIFAR-100	Accuracy92.42	435
Natural Language Understanding	GLUE	SST-294.84	55
Image Classification	CIFAR-10	Accuracy99.18	5
Large Language Model Pre-training	C4	Model Weights Share12.55	5
Image Classification	ImageNet	Accuracy65.34	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord