Memory Efficient Optimizers with 4-bit States

About

Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.

Bingrui Li, Jianfei Chen, Jun Zhu• 2023

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy79.2	1896
Language Modeling	C4 (val)	--	737
Image Classification	ImageNet-1K	--	600
Natural Language Understanding	GLUE	SST-296.4	551
Commonsense Reasoning	ARC Challenge	Accuracy48	243
Reasoning	ARC Easy	Accuracy64.1	233
Question Answering	SQuAD 2.0	F189	215
Language Understanding	MMLU 5-shot	Accuracy54.9	153
Natural language generation	E2E (test)	ROUGE-L68.9	100
Reasoning	OpenBookQA	Accuracy45.4	92

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord