FP8 Formats for Deep Learning

About

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu• 2022

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy61.7	1398
Multi-task Language Understanding	MMLU	Accuracy64.5	353
Document Question Answering	Qasper	Accuracy40.8	44
Variable Tracking	RULER-VT	Accuracy99.9	33
Key-Value Retrieval	LITM (Lost in the Middle)	Accuracy99.8	33
Long-context Language Understanding	LongBench 1 host v1 (test)	2WQA Score40.63	14
Numerical Stability Evaluation	Pretrained Models First Forward Pass	Max Scaled Logit5.60e+3	8
Hardware synthesis	process technology 7nm	Area (mm^2)4.36	8
Question Answering	MMLU STEM (test)	Loss0.0116	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord