MERGETUNE: Continued Fine-Tuning of Vision-Language Models
About
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet V2 | -- | 487 | |
| Image Classification | ObjectNet | -- | 177 | |
| Image Classification | ImageNet Rendition | Top-1 Accuracy78.68 | 77 | |
| Image Classification | ImageNet-Sketch | Accuracy53.1 | 77 | |
| Base-to-New Generalization | DTD | Base Accuracy86.77 | 68 | |
| Base-to-New Generalization | FGVCAircraft | Base Performance49.82 | 64 | |
| Base-to-New Generalization | UCF101 | Base Accuracy89.99 | 57 | |
| Image Classification | ImageNet and Distribution Shifts | ImageNet-V2 Accuracy67.02 | 49 | |
| Base-to-New Generalization | OxfordPets | Base Score96.63 | 48 | |
| Base-to-New Generalization | Caltech101 | Base Score98.93 | 44 |