Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

About

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7\times$ compared to the FP16 and by $2.5\times$ compared to INT8 quantization, while maintaining the same latency target.

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci• 2023

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity5.14
2839
Language ModelingWikiText-2 (test)
PPL3.57
1949
Language ModelingWikiText-2--
1624
Language ModelingC4
Perplexity5
1422
Language ModelingPTB
Perplexity22.16
1034
Multi-task Language UnderstandingMMLU--
876
Language ModelingC4 (val)
PPL7.03
514
Multi-task Language UnderstandingMMLU
Accuracy79.54
321
Language UnderstandingMMLU (test)
MMLU Average Accuracy25.1
163
Language UnderstandingMMLU 5-shot
Accuracy45.01
132
Showing 10 of 20 rows

Other info

Follow for update