QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

About

Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity4.94	3785
Language Modeling	C4	Perplexity6.53	1565
Visual Question Answering	TextVQA (test)	Accuracy74.1	124
Multimodal Understanding	MMMU (test)	--	112
Visual Question Answering	VizWiz (test)	Accuracy61.3	105
Optical Character Recognition	OCRBench (test)	Score61.2	49
General Language Evaluation	5 Datasets Zero-shot	Average Accuracy72.87	33
Image Classification	ImageNet-1k (val)	Top-1 Acc (W6A6)83.8	23

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord