Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers
About
With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required. As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText2 | Perplexity4.254 | 1875 | |
| Language Modeling | WikiText-2 (test) | PPL4.254 | 1541 | |
| Language Modeling | C4 | Perplexity6.256 | 1182 | |
| Language Modeling | PTB | Perplexity8.283 | 650 | |
| Language Modeling | PTB (test) | Perplexity8.283 | 471 | |
| Language Modeling | C4 (test) | Perplexity6.256 | 268 | |
| Question Answering | Evaluation Suite (ARC, HellaSwag, MMLU) Zero-shot (test) | ARC-C50.34 | 67 | |
| Quantization | OPT | Processing Time (s)74.4 | 46 | |
| Quantization | LLAMA | Processing Time (hr)6.84 | 30 | |
| Quantization | OPT v1 (train) | Processing Time (min)1.24 | 23 |