ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

About

Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	Perplexity (PPL)3.69	2320
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)72.18	358
Multi-task Language Understanding	MMLU	Accuracy63.3	353
Multi-task Language Understanding	MMLU (test)	Normalized Accuracy67.12	87
Zero-shot Evaluation	6 zero-shot downstream tasks	Average Accuracy72.86	70
Multi-task Language Understanding	MMLU	Accuracy (5-shot)76.88	31
Zero-shot Classification	WinoGrande, PiQA, HellaSwag, ARC-easy, ARC-challenge, BoolQ Zero-shot	Avg Zero-shot Acc75	31
Language Modeling	WikiText-2 context length 4096 (test)	PPL (WikiText-2)6.74	15
Language Modeling	WikiText-2 context length 2048 (test)	Perplexity7.15	7
Language Modeling	C4 context length 2048 (test)	Perplexity8.84	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord