GWT: Scalable Optimizer State Compression for Large Language Model Training

About

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing benchmarks. However, the escalating scale of model parameters imposes prohibitive memory overheads during training, especially when employing stateful optimizers such as Adam. Conventional memory-efficient strategies, typically involving singular value decomposition (SVD) or weight freezing, often incur non-negligible performance degradation relative to full-rank updates. To address these limitations, this paper explores memory-efficient optimization beyond low-rank constraints and proposes the Gradient Wavelet Transform (GWT). GWT characterizes a novel compression framework that projects gradients into wavelet subspaces, effectively compacting optimizer states while preserving essential update information. We theoretically and empirically demonstrate that GWT can be seamlessly integrated into existing optimization protocols, facilitating resource-efficient training without compromising model fidelity. Rigorous evaluations encompassing both large-scale pre-training and task-specific fine-tuning reveal that GWT yields performance parity with advanced memory-efficient optimizers and full-rank updates. Furthermore, GWT provides a scalable and robust solution for managing the memory-intensive pipelines inherent in modern large-scale data engineering and knowledge discovery systems.

Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	C4 (val)	PPL13.48	908
Natural Language Understanding	GLUE (val)	SST-294.26	201
Multitask Language Understanding	MMLU (val)	Accuracy74.12	94
Language Modeling	C4 LLaMA-130M (val)	Perplexity23.84	40
Language Modeling	C4 Qwen2.5 (val)	Perplexity (PPL)17.6	27
Language Modeling	C4 LLaMA-60M (val)	Perplexity32.94	25
Language Modeling	C4 LLaMA-350M (val)	Perplexity18.12	23
Language Modeling Pre-training	C4 (val)	PPL (60k)14.8	14
Language Modeling	C4 LLaMA-1.3B (val)	Perplexity14.99	12

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord