Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

About

Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune.

Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis, Sujay Sanghavi• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingOpenWebText (val)
Validation Loss2.64
70
Downstream Language UnderstandingOpen LLM Leaderboard zero-shot
ARCE52.9
6
Zero-shot downstream reasoning and question answeringAccuracy-based tasks (ARC-E, PIQA, SciQ, HellaSwag, LAMBADA, WinoGrande, BoolQ) zero-shot
ARC-E51.22
2
Zero-shot Language ModelingPerplexity-based tasks (Wikitext, LAMBADA) zero-shot
Wikitext Perplexity25.52
2
Showing 4 of 4 rows

Other info

Follow for update