Sign-SGD via Parameter-Free Optimization

About

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.

Daniil Medyakov, Sergey Stanko, Gleb Molodtsov, Philip Zmushko, Grigoriy Evseev, Egor Petrov, Aleksandr Beznosikov• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	TinyImageNet (val)	Accuracy79.161	289
Language Modeling	C4 LLaMA-130M (val)	Perplexity18.504	27
Language Modeling Pre-training	C4 (val)	--	14
Language Modeling	LLaMA-350M pre-training (val)	Validation Loss2.707	10
MRI Reconstruction	fastMRI	--	7
Molecular property prediction	OGBG	mAP24.2	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord