Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

About

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csord\'as et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

Ferdinand Kapl, Emmanouil Angelis, Tobias H\"oppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingLAMBADA
Accuracy51.41
183
Common Sense ReasoningHellaSwag
Accuracy46.32
164
Mathematical ReasoningMath Word
Accuracy24.6
16
Synthetic ReasoningReasoning Primitives
Accuracy53
16
Question AnsweringOpen-book Q&A
F1 Score29.84
8
Question AnsweringQ&A Closed-book
F1 Score19.08
8
Language ModelingHoldout Set
NLL1.96
8
Showing 7 of 7 rows

Other info

Follow for update