Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sign-SGD via Parameter-Free Optimization

About

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.

Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationTinyImageNet (val)
Accuracy79.161
240
Language ModelingC4 LLaMA-130M (val)
Perplexity18.504
27
Language ModelingLLaMA-350M pre-training (val)
Validation Loss2.707
10
Language Modeling Pre-trainingC4 (val)--
10
MRI ReconstructionfastMRI
SSIM0.724
6
Molecular property predictionOGBG
mAP24.2
6
Showing 6 of 6 rows

Other info

Follow for update