On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion
About
Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?} In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. % To address this, To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96.4\% in single-task scenarios and by 86.3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy57.15 | 756 | |
| Multitask Language Understanding | MMLU | Accuracy51.31 | 206 | |
| Question Answering | TriviaQA | EM57.11 | 116 | |
| Mathematical Reasoning | GSM8K | EM34.87 | 115 | |
| Question Answering | TruthfulQA | Accuracy61.56 | 82 | |
| Summarization | CNN/DM | -- | 56 | |
| Text Summarization | CNN/DM | ROUGE-210.52 | 16 | |
| General Language Proficiency | Aggregated GSM8K, TruthfulQA, TriviaQA, CNN/DM, MMLU | Average Score46.09 | 9 | |
| Mathematical Reasoning | GSM8K | EM Accuracy39.34 | 9 |