Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QTIP: Quantization with Trellises and Incoherence Processing

About

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes. Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping. However, VQ requires a codebook with size exponential in the dimension. This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality. Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

Albert Tseng, Qingyao Sun, David Hou, Christopher De Sa• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity3.16
1875
Language ModelingWikiText-2 (test)
PPL2.75
1541
Commonsense ReasoningHellaSwag
Accuracy60.8
1460
Language ModelingC4
Perplexity5
1182
Language ModelingWikiText-2
Perplexity (PPL)5.11
841
ReasoningBBH
Accuracy36.27
507
Language ModelingC4 (val)
PPL5.83
392
Language ModelingWikiText2 v1 (test)
Perplexity1.79
341
Instruction FollowingIFEval--
292
Language ModelingC4 (test)
Perplexity5
268
Showing 10 of 36 rows

Other info

Code

Follow for update