Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

About

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	C4 (train)	--	50
LLM Pretraining	C4	Perplexity13.49	47
LLM Pretraining	C4	Perplexity15.57	16
Language Modeling	C4	Perplexity11.96	2
Natural Language Understanding	GLUE	RTE Accuracy80.46	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord