Llamba: Scaling Distilled Recurrent Models for Efficient Language Processing
About
We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math Reasoning | GSM8K | Accuracy (GSM8K)47.8 | 131 | |
| Commonsense Reasoning | Commonsense Reasoning Suite BoolQ, PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c | BoolQ Accuracy68.62 | 43 | |
| Long-context Reasoning | RULER | RULER Performance (8K Context)3.5 | 35 | |
| Common Sense Reasoning | Common Sense Reasoning ARC, ARC-Easy, HellaSwag, OpenBookQA, PIQA, RACE, WinoGrande | ARC Accuracy45.7 | 13 | |
| Common Sense Reasoning | Common Sense Reasoning (ARC, ARE, HS, OB, PI, RA, WG) | ARC Score37.1 | 12 |