Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LESA: Learnable LLM Layer Scaling-Up

About

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.

Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, Hai Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy32.09
1896
Code GenerationHumanEval
Pass@125
1043
Multi-task Language UnderstandingMMLU
Accuracy36.43
881
Language ModelingWikiText-103 (test)
Perplexity7.72
703
Question AnsweringARC-E
Accuracy42.86
523
Commonsense ReasoningWinoGrande
Accuracy60.38
453
Boolean Question AnsweringBoolQ
Accuracy66.33
350
Question AnsweringBoolQ
Accuracy70.46
317
Question AnsweringARC-C
Accuracy32.54
258
Question AnsweringTriviaQA
Accuracy67.15
238
Showing 10 of 20 rows

Other info

Code

Follow for update