8-bit Optimizers via Block-wise Quantization

About

Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT'14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer• 2021

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2 (test)	PPL3.42	2333
Language Modeling	C4 (val)	--	737
Image Classification	ImageNet-1K	--	600
Natural Language Understanding	GLUE	--	551
Multitask Language Understanding	MMLU (test)	Accuracy69.18	312
Language Modeling	WikiText-103	PPL20.22	216
Machine Translation	WMT En-De '14	BLEU26.66	89
Multiple-choice Question Answering	MMLU 5-shot	Accuracy66.92	73
Regression	California Housing	MSE0.189	71
Question Answering	SQuAD	F1 Score94.5	63

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord