Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling and Transferability of Annealing Strategies in Large Language Model Training

About

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang, Xunliang Cai, Jingang Wang, Xiaomeng Li• 2025

Related benchmarks

TaskDatasetResultRank
Loss Curve PredictionDense Model Loss Curve Prediction WSD to Cosine transfer
MAPE0.232
9
Loss Curve PredictionDense Model Loss Curve Prediction Cosine to WSD transfer
MAPE0.41
9
Loss curve fitting across batch sizesModel loss data (train)
ASMT Score0.529
7
Loss curve fitting across model sizesDense models
ASMT (MAPE)0.402
3
Loss curve fitting across model sizesMoE models (various sizes)
ASMT MAPE0.341
3
Showing 5 of 5 rows

Other info

Follow for update