Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM

About

Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred with various quantization precisions based on available computational resources. Recent work on such any-precision quantization either relies on hardware-inefficient vector quantization or induces additional scaling factors when switching between bit-widths. Meanwhile, existing post-training quantization (PTQ) methods calibrated for a fixed low precision show poor generalizability under runtime precision change. In this work, we attribute the source of poor generalization across bit-widths to a precision-dependent \textit{outlier migration} phenomenon where the distribution of PTQ-sensitive tokens changes across precisions. Motivated by this observation, we propose \texttt{MoBiQuant}, a novel any-precision Mixture-of-Bits quantization framework that adjusts weight precision for flexible LLM inference based on token sensitivity. Specifically, we propose a many-in-one recursive residual quantization that can iteratively reconstruct higher-precision weights at runtime and mitigates \textit{outlier migration} with a token-aware router to dynamically select the optimal inference precision of each token.Extensive experiments show that \texttt{MoBiQuant} matches or surpasses frontier single-precision PTQ while exhibiting strong elasticity, achieving significant memory savings and throughput gains of up to $1.34\times$ over state-of-the-art any-precision methods.

Dongwei Wang, Jinhee Kim, Seokho Han, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, KhayTze Peong, Kang Eun Jeon, Jong Hwan Ko, Yiran Chen, Huanrui Yang• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity5.08
3785
Language ModelingWikiText-2
Perplexity (PPL)16.909
2320
Commonsense ReasoningHellaSwag
Accuracy40.2
1896
Language ModelingC4
Perplexity24.849
1688
Commonsense ReasoningWinoGrande
Accuracy55.6
1442
Language ModelingPTB
Perplexity29.268
1234
Question AnsweringARC Challenge
Accuracy28.4
906
Question AnsweringBoolQ
Accuracy61.9
317
Common Sense ReasoningCommon Sense Reasoning Tasks (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA, WinoGrande) zero-shot
Average Accuracy (Zero-Shot)71.84
92
Question AnsweringARC Easy
Normalized Accuracy53.2
55
Showing 10 of 10 rows

Other info

Follow for update