Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence
About
Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(\epsilon^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K | Top-1 Acc81.64 | 1239 | |
| Image Classification | CIFAR-100 | Accuracy92.42 | 435 | |
| Natural Language Understanding | GLUE | SST-294.84 | 55 | |
| Image Classification | CIFAR-10 | Accuracy99.18 | 5 | |
| Large Language Model Pre-training | C4 | Model Weights Share12.55 | 5 | |
| Image Classification | ImageNet | Accuracy65.34 | 3 |