Training-Free Activation Sparsity in Large Language Models
About
Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Medical Question Answering | MedMCQA | Accuracy52.95 | 253 | |
| Long-context Language Understanding | LongBench | -- | 219 | |
| Question Answering | CommonsenseQA | Accuracy74.77 | 143 | |
| General Reasoning | MMLU | MMLU Accuracy76.63 | 126 | |
| Long-context Understanding | LongBench | Overall Average Score30.54 | 115 | |
| Question Answering | TruthfulQA | Accuracy57.08 | 73 | |
| Language Modeling | Wikitext (test) | Perplexity5.52 | 52 | |
| Code | HumanEval | HumanEval Accuracy46.95 | 50 | |
| Commonsense Reasoning | Commonsense Reasoning | Accuracy74.12 | 44 | |
| Multi-task Language Understanding and Reasoning | OpenCompass SIQA, GSM8K, WiC, HumanEval, MMLU, CSQA | SIQA66.53 | 30 |