Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

About

While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.

Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Easy
Accuracy89.98
597
Knowledge EvaluationMMLU
MMLU Accuracy79.28
26
Mathematical ReasoningGSM8K
Accuracy (GSM8K)89.01
14
Question AnsweringARC Challenge
Accuracy72.61
14
Hardware EfficiencyGEMM M = 16-64
Speedup1.72
3
Hardware EfficiencyGEMM M ≥ 128
Speedup1.88
3
Showing 6 of 6 rows

Other info

Follow for update