Lizard: An Efficient Linearization Framework for Large Language Models
About
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | -- | 1442 | |
| Multi-task Language Understanding | MMLU | Accuracy65.1 | 881 | |
| Question Answering | ARC-E | Accuracy83.1 | 523 | |
| Question Answering | ARC-C | -- | 116 | |
| Common Sense Reasoning | PIQA | Accuracy82.2 | 100 | |
| Commonsense Reasoning | PIQA 1.0 (test) | Accuracy82 | 64 | |
| Common Sense Reasoning | HellaSwag | Accuracy (acc_n)73.6 | 47 | |
| Commonsense Reasoning | WinoGrande 1.0 (test) | Accuracy72 | 31 | |
| General Language Understanding | Overall LLM Evaluation Suite PiQA, ARC, HellaSwag, WinoGrande, MMLU v1 | Overall Accuracy74.6 | 16 | |
| Question Answering | ARC Easy v1 (test) | Accuracy83.5 | 16 |