Low-Rank Few-Shot Adaptation of Vision-Language Models

About

Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.

Maxime Zanella, Ismail Ben Ayed• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy86	660
Image Classification	EuroSAT	--	569
Image Classification	UCF101	Top-1 Acc86.2	527
Classification	Cars	Accuracy86	492
Image Classification	Food101	Accuracy85.1	457
Image Classification	SUN397	Accuracy76	450
Image Classification	Oxford-IIIT Pets	Accuracy92.3	378
Image Classification	ImageNet	Top-1 Accuracy73.4	366
Image Classification	Pets	Accuracy91.9	308
Image Classification	Oxford Flowers 102	Accuracy97.9	234

Showing 10 of 92 rows

...

Other info

Follow for update

@wizwand_team Discord