Mashup Learning: Faster Finetuning by Remixing Past Checkpoints
About
Finetuning on domain-specific data is a well-established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in-house or on open-source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5-5 percentage points over training from scratch. It also accelerates convergence, requiring 41-46% fewer training steps and up to 37% less total wall-clock time to match from-scratch accuracy, including all selection and merging overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy82.5 | 1085 | |
| Question Answering | ARC Easy | Accuracy87.9 | 597 | |
| Question Answering | PIQA | Accuracy86.5 | 374 | |
| Mathematical Reasoning | MathQA | Accuracy32.1 | 305 | |
| Commonsense Reasoning | HellaSwag | Accuracy95.1 | 47 | |
| Reasoning | ARC-e (leave-one-out setup) | Accuracy (ARC-e LOO)93.5 | 12 | |
| Reasoning | CSQA (leave-one-out setup) | Accuracy83.8 | 12 | |
| Reasoning | Hella leave-one-out setup | Average Accuracy94.7 | 12 | |
| Reasoning | MathQA leave-one-out setup | Average Accuracy56.9 | 12 | |
| Reasoning | OBQA (leave-one-out setup) | Average Accuracy87.7 | 12 |