Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

About

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(\epsilon^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

Hui Yang, Tao Ren, Jinyang Jiang, Wan Tian, Yijie Peng• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc81.64
1239
Image ClassificationCIFAR-100
Accuracy92.42
435
Natural Language UnderstandingGLUE
SST-294.84
55
Image ClassificationCIFAR-10
Accuracy99.18
5
Large Language Model Pre-trainingC4
Model Weights Share12.55
5
Image ClassificationImageNet
Accuracy65.34
3
Showing 6 of 6 rows

Other info

Follow for update