FOAM: Blocked State Folding for Memory-Efficient LLM Training

About

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM eliminates up to 90\% of the memory overhead of optimizer states and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines. Code is available at https://github.com/zqOuO/FOAM.

Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	C4 LLaMA-130M (val)	Perplexity22.51	40
Language Modeling	C4 Qwen2.5 (val)	Perplexity (PPL)15.8	27
Language Modeling	C4 LLaMA-60M (val)	Perplexity28.53	25
Language Modeling	C4 LLaMA-350M (val)	Perplexity15.87	23
Natural Language Understanding	GLUE RoBERTa LARGE (test dev)	MNLI Accuracy89.74	22
Fine-tuning	MMLU (val)	MMLU STEM Accuracy69.12	14
Language Modeling Pre-training	C4 (val)	PPL (60k)13.33	14
Language Modeling	C4 LLaMA-1.3B (val)	Perplexity13.13	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord