When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

About

Large Language Models (LLMs) are known for their performance, but we uncover a significant structural inefficiency: a phenomenon we term attention collapse. In many pre-trained decoder-style LLMs, the attention matrices in deeper layers degenerate, collapsing to near rank-one structures. These underutilized layers, which we call lazy layers, are redundant and impair model efficiency. To address this, we introduce Inheritune, a simple yet powerful training recipe designed to build smaller, stronger language models. Inheritune initializes a compact model by inheriting the potent early layers from a larger pre-trained model and then progressively trains and expands it. Our experiments on various models, including the GPT-2 family, demonstrate that models trained with Inheritune can match or even surpass the performance of their larger counterparts, despite having significantly fewer layers. This work presents a novel path toward model compression by design, enabling the creation of compact, yet highly performant language models. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune.

Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis, Sujay Sanghavi• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	OpenWebText (val)	Validation Loss2.64	114
Downstream Language Understanding	Open LLM Leaderboard zero-shot	ARCE52.9	6
Zero-shot downstream reasoning and question answering	Accuracy-based tasks (ARC-E, PIQA, SciQ, HellaSwag, LAMBADA, WinoGrande, BoolQ) zero-shot	ARC-E51.22	2
Zero-shot Language Modeling	Perplexity-based tasks (Wikitext, LAMBADA) zero-shot	Wikitext Perplexity25.52	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord