Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

About

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee• 2025

Related benchmarks

TaskDatasetResultRank
Matrix Multiplication LatencySynthetic Matrix Multiplication Shapes
Latency (µs)20.66
189
Language UnderstandingMMLU 5-shot (test)
Accuracy57.42
149
Linear Layer Latency InferenceLlama-3-8B decoder block
Latency (µs)153
36
Question AnsweringARC-E 0-shot
Accuracy73.91
29
Common Sense ReasoningHellaSwag 0-shot
Accuracy73.85
22
Question AnsweringARC-Challenge 0-shot (test)
Accuracy47.7
10
Commonsense ReasoningWinoGrande 0-shot (test)
Accuracy69.06
10
Matrix Multiplication LatencyLlama-3-8B
Kernel-level latency (µs)152.7
8
Matrix Multiplication LatencyLlama-3 70B
Kernel Latency (µs)293.8
8
Language UnderstandingLlama-3.1-70B Evaluation Suite MMLU, WinoGrande, HellaSwag, ARC-Easy, ARC-Challenge
MMLU71.21
7
Showing 10 of 11 rows

Other info

Follow for update