Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

About

Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature $T$, producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as $T \to 0$, and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ($6.57\times$ speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive empirical result on Llama-3.2-1B at 2 bpw under end-to-end forward-KL distillation: with the right schedule (skip the high-$T$ phase to avoid an overshoot we diagnose), single-layer BCJR-QAT beats QTIP-PTQ by $\mathbf{-0.084}$ PPL on WikiText-2, and multi-layer compounding is super-additive.

Venugopalan Iyengar• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-2
Perplexity (PPL)10.41
2320
Question AnsweringARC Challenge
Accuracy (ARC)39.16
598
Commonsense ReasoningHellaSwag
HellaSwag Score70.82
53
Physical ReasoningPIQA
PIQA Normalized Performance76.93
12
Language ModelingC4 300K-token sample
Perplexity14.8
4
Showing 5 of 5 rows

Other info

Follow for update