Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

About

As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Manyi Zhang, Yuanyong Luo, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)34.89	2320
Commonsense Reasoning	HellaSwag	Accuracy57.3	1896
Language Modeling	C4	Perplexity15.28	1688
Language Modeling	WikiText	PPL9.79	740
Long-context language modeling	LongBench	Average Score34.55	328
Multi-task Language Understanding	MMLU	Accuracy72.9	136
Reasoning	ARC Challenge	Accuracy34	100
Model Evaluation Summary	Overall Aggregate	Average Score1.003	22
Quantization Performance Summary	Aggregated Benchmarks HellaSwag, MMLU, Arc-C, MATH-500	Average Score1.014	22
Quantization Robustness Evaluation	Average across Wikitext, C4, HellaSwag, MMLU, Arc-C, MATH500, and GSM8K	Accuracy Loss Delta (%)-0.29	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord