MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

About

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each token.Extensive experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity5.08	3785
Language Modeling	WikiText-2	Perplexity (PPL)16.909	2320
Commonsense Reasoning	HellaSwag	Accuracy40.2	1896
Language Modeling	C4	Perplexity24.849	1688
Commonsense Reasoning	WinoGrande	Accuracy55.6	1442
Language Modeling	PTB	Perplexity29.268	1234
Question Answering	ARC Challenge	Accuracy28.4	906
Question Answering	BoolQ	Accuracy61.9	317
Common Sense Reasoning	Common Sense Reasoning Tasks (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot	Average Accuracy (Zero-Shot)71.84	92
Question Answering	ARC Easy	Normalized Accuracy53.2	55

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord