Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bridging Training and Merging Through Momentum-Aware Optimization

About

Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging--wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method incurs modest memory overhead (approximately 30% over AdamW) to accumulate task saliency scores that enable curvature-aware merging. These scores, computed as a byproduct of optimization, provide importance estimates comparable to post-hoc Fisher computation while producing merge-ready models directly from training. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving 1.6% over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach demonstrates that training-time curvature information suffices for effective model composition, enabling a unified training-merging pipeline.

Alireza Moayedikia, Alicia Troncoso• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb (val)
Validation Loss2.03
159
Natural Language UnderstandingGLUE (test val)
MRPC Accuracy79.4
59
Natural Language UnderstandingGLUE BERT-base experts (test)
RTE Score57.8
6
Showing 3 of 3 rows

Other info

Follow for update