Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

About

Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

Chen Chen, Lai Wei• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag--
1891
Code GenerationHumanEval--
1036
Commonsense ReasoningPIQA
Accuracy77.5
751
ReasoningBBH--
672
Commonsense ReasoningARC Challenge
Accuracy53.6
190
Language ModelingLAMBADA--
150
ReasoningMMLU-Pro
Accuracy35.6
95
Common Sense ReasoningARC Easy--
72
Code GenerationMBPP
MBPP Score42.6
35
General ReasoningAGI Eval English
Score46.5
32
Showing 10 of 18 rows

Other info

Follow for update