When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models
About
Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | OpenWebText (val) | Validation Loss2.64 | 70 | |
| Downstream Language Understanding | Open LLM Leaderboard zero-shot | ARCE52.9 | 6 | |
| Zero-shot downstream reasoning and question answering | Accuracy-based tasks (ARC-E, PIQA, SciQ, HellaSwag, LAMBADA, WinoGrande, BoolQ) zero-shot | ARC-E51.22 | 2 | |
| Zero-shot Language Modeling | Perplexity-based tasks (Wikitext, LAMBADA) zero-shot | Wikitext Perplexity25.52 | 2 |