Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpinQuant: LLM quantization with learned rotations

About

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity3.43
2839
Language ModelingWikiText-2 (test)
PPL5.21
1949
Commonsense ReasoningHellaSwag
Accuracy79.94
1891
Language ModelingWikiText-2
Perplexity (PPL)3.7
1624
Language ModelingC4
Perplexity6.07
1422
Language ModelingC4
Perplexity11.95
1071
Mathematical ReasoningGSM8K (test)
Accuracy66.1
900
Multi-task Language UnderstandingMMLU
Accuracy70.14
876
Language ModelingWikiText
PPL3.73
732
Language ModelingC4 (val)
PPL6.86
514
Showing 10 of 52 rows

Other info

Follow for update