Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

About

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

Taha Koleilat, Hassan Rivaz, Yiming Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationDTD
Accuracy45.15
599
Image ClassificationSUN397
Accuracy67.74
450
Image ClassificationStanfordCars
Accuracy65
384
Image ClassificationImageNet
Top-1 Accuracy72.15
343
Image ClassificationAircraft
Accuracy26.03
340
Image ClassificationOxfordPets
Accuracy91.06
298
Image ClassificationFood101
Accuracy86.21
177
Image ClassificationFlowers102
Accuracy72.45
88
Image ClassificationAverage of 11 datasets (ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) Base-to-Novel Generalization
Harmonic Mean (HM)80.13
68
Image ClassificationUCF101
Accuracy69.91
54
Showing 10 of 22 rows

Other info

Follow for update