Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

About

Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)3.69
841
Mathematical ReasoningGSM8K
Accuracy (GSM8K)72.18
358
Multi-task Language UnderstandingMMLU
Accuracy63.3
87
Multi-task Language UnderstandingMMLU (test)
Normalized Accuracy67.12
76
Multi-task Language UnderstandingMMLU
Accuracy (5-shot)76.88
31
Zero-shot ClassificationWinoGrande, PiQA, HellaSwag, ARC-easy, ARC-challenge, BoolQ Zero-shot
Avg Zero-shot Acc75
31
Zero-shot Evaluation6 zero-shot downstream tasks
Average Accuracy72.86
19
Language ModelingWikiText-2 context length 2048 (test)
Perplexity7.15
7
Language ModelingC4 context length 2048 (test)
Perplexity8.84
6
Language ModelingWikiText-2 context length 4096 (test)
PPL (WikiText-2)6.74
5
Showing 10 of 11 rows

Other info

Follow for update