Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MERGETUNE: Continued Fine-Tuning of Vision-Language Models

About

Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.

Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet V2--
487
Image ClassificationObjectNet--
177
Image ClassificationImageNet Rendition
Top-1 Accuracy78.68
77
Image ClassificationImageNet-Sketch
Accuracy53.1
77
Base-to-New GeneralizationDTD
Base Accuracy86.77
68
Base-to-New GeneralizationFGVCAircraft
Base Performance49.82
64
Base-to-New GeneralizationUCF101
Base Accuracy89.99
57
Image ClassificationImageNet and Distribution Shifts
ImageNet-V2 Accuracy67.02
49
Base-to-New GeneralizationOxfordPets
Base Score96.63
48
Base-to-New GeneralizationCaltech101
Base Score98.93
44
Showing 10 of 20 rows

Other info

Follow for update