Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Curse of Depth in Large Language Models

About

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingLAMBADA
Accuracy48.94
183
Common Sense ReasoningHellaSwag
Accuracy45.44
164
Language ModelingPre-training corpus (train)
Perplexity15.71
20
Synthetic ReasoningReasoning Primitives
Accuracy50.58
16
Mathematical ReasoningMath Word
Accuracy17.84
16
Question Answering and ReasoningDownstream Reasoning Suite (Arc-e, PIQA, Hellaswag, OpenBookQA, Winogrande, MMLU, BoolQ)
ARC-e34.49
14
Language Understanding and ReasoningMMLU, BoolQ, ARC-e, PIQA, Hellaswag, OBQA, Winogrande (test)
MMLU28.69
10
Language ModelingPretraining Dataset
Train Loss (PT)3.16
10
Question AnsweringQ&A Closed-book
F1 Score18.63
8
Language ModelingHoldout Set
NLL1.97
8
Showing 10 of 16 rows

Other info

Follow for update