MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic
About
The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy56.53 | 963 | |
| Multimodal Evaluation | MME | Score67.62 | 557 | |
| Visual Question Answering | TextVQA (val) | VQA Score77.18 | 309 | |
| OCR Evaluation | OCRBench | Score33.9 | 296 | |
| Visual Question Answering | OKVQA | Top-1 Accuracy56.54 | 283 | |
| Visual Question Answering | OK-VQA | Accuracy43.02 | 224 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Text-based Visual Question Answering | TextVQA (val) | Accuracy55.83 | 146 | |
| Visual Question Answering | GQA (test) | Accuracy59.93 | 119 | |
| Multimodal Reasoning | MMMU (val) | Accuracy34.9 | 114 |