FP8 Formats for Deep Learning
About
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy61.7 | 1362 | |
| Multi-task Language Understanding | MMLU | Accuracy64.5 | 321 | |
| Document Question Answering | Qasper | Accuracy40.8 | 44 | |
| Variable Tracking | RULER-VT | Accuracy99.9 | 33 | |
| Key-Value Retrieval | LITM (Lost in the Middle) | Accuracy99.8 | 33 | |
| Long-context Language Understanding | LongBench 1 host v1 (test) | 2WQA Score40.63 | 14 | |
| Numerical Stability Evaluation | Pretrained Models First Forward Pass | Max Scaled Logit5.60e+3 | 8 | |
| Hardware synthesis | process technology 7nm | Area (mm^2)4.36 | 8 | |
| Question Answering | MMLU STEM (test) | Loss0.0116 | 3 |