MERGETUNE: Continued Fine-Tuning of Vision-Language Models

About

Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.

Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	--	749
Image Classification	ObjectNet	Accuracy56.84	251
Image Classification	ImageNet Rendition	Top-1 Accuracy78.68	113
Base-to-New Generalization	DTD	Base Accuracy86.77	94
Base-to-New Generalization	FGVCAircraft	Base Performance49.82	90
Image Classification	ImageNet-Sketch	Accuracy53.1	89
Base-to-New Generalization	OxfordPets	Base Score96.63	76
Base-to-New Generalization	UCF101	Base Accuracy89.99	71
Base-to-New Generalization	Caltech101	Base Score98.93	70
Base-to-New Generalization	StanfordCars	Base Score82.98	69

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord