CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

About

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee• 2025

Related benchmarks

Task	Dataset	Result
Matrix Multiplication Latency	Synthetic Matrix Multiplication Shapes	Latency (µs)20.66	198
Language Understanding	MMLU 5-shot (test)	Accuracy57.42	149
Question Answering	ARC-Challenge 0-shot (test)	Accuracy47.7	48
Common Sense Reasoning	HellaSwag 0-shot	Accuracy73.85	38
Question Answering	ARC-E 0-shot	Accuracy73.91	37
Linear Layer Latency Inference	Llama-3-8B decoder block	Latency (µs)153	36
Language Understanding	Llama-3.1-70B Evaluation Suite MMLU, WinoGrande, HellaSwag, ARC-Easy, ARC-Challenge	MMLU71.21	13
Commonsense Reasoning	WinoGrande 0-shot (test)	Accuracy69.06	10
Matrix Multiplication Latency	Llama-3-8B	Kernel-level latency (µs)152.7	8
Matrix Multiplication Latency	Llama-3 70B	Kernel Latency (µs)293.8	8

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord