Memory Efficient Optimizers with 4-bit States
About
Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy79.2 | 1460 | |
| Image Classification | ImageNet-1K | -- | 524 | |
| Natural Language Understanding | GLUE | SST-296.4 | 452 | |
| Question Answering | SQuAD 2.0 | F189 | 190 | |
| Reasoning | ARC Easy | Accuracy64.1 | 183 | |
| Language Understanding | MMLU 5-shot | Accuracy54.9 | 132 | |
| Commonsense Reasoning | ARC Challenge | Accuracy48 | 132 | |
| Machine Translation | WMT En-De '14 | BLEU26.45 | 89 | |
| Question Answering | SQuAD v1.1 | F194.6 | 79 | |
| Natural language generation | E2E (test) | ROUGE-L68.9 | 79 |