Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Training-Free Activation Sparsity in Large Language Models

About

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL, a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53$\times$ and 1.8$\times$ at 40% and 50% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun• 2024

Related benchmarks

TaskDatasetResultRank
Medical Question AnsweringMedMCQA
Accuracy52.95
253
Long-context Language UnderstandingLongBench--
219
Question AnsweringCommonsenseQA
Accuracy74.77
143
General ReasoningMMLU
MMLU Accuracy76.63
126
Long-context UnderstandingLongBench
Overall Average Score30.54
115
Question AnsweringTruthfulQA
Accuracy57.08
73
Language ModelingWikitext (test)
Perplexity5.52
52
CodeHumanEval
HumanEval Accuracy46.95
50
Commonsense ReasoningCommonsense Reasoning
Accuracy74.12
44
Multi-task Language Understanding and ReasoningOpenCompass SIQA, GSM8K, WiC, HumanEval, MMLU, CSQA
SIQA66.53
30
Showing 10 of 15 rows

Other info

Follow for update