Dataless Knowledge Fusion by Merging Weights of Language Models
About
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 | Accuracy82.59 | 691 | |
| Image Classification | Stanford Cars | Accuracy70.8 | 660 | |
| Image Classification | DTD | Accuracy71.76 | 599 | |
| Natural Language Inference | RTE | Accuracy81.2 | 590 | |
| Image Classification | Food-101 | Accuracy76.14 | 570 | |
| Image Classification | EuroSAT | Accuracy78.6 | 569 | |
| Natural Language Understanding | GLUE | SST-290.6 | 551 | |
| Classification | Cars | Accuracy26.68 | 492 | |
| Image Classification | DTD | Accuracy52 | 487 | |
| Image Classification | RESISC45 | Accuracy88.7 | 472 |