Low-Rank Few-Shot Adaptation of Vision-Language Models
About
Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | EuroSAT | -- | 497 | |
| Image Classification | UCF101 | Top-1 Acc84.1 | 404 | |
| Image Classification | Oxford-IIIT Pets | Accuracy92.3 | 259 | |
| Image Classification | FGVC Aircraft | Top-1 Accuracy46.2 | 185 | |
| Image Classification | Caltech-101 | Top-1 Accuracy95.8 | 146 | |
| Image Classification | 11 Downstream Classification Datasets (ImageNet, Flowers102, DTD, OxfordPets, StanfordCars, UCF101, Caltech101, Food101, SUN397, FGVC-Aircraft, EuroSAT) standard (test) | DTD Accuracy73.9 | 115 | |
| Image Classification | Oxford 102 Flowers | Top-1 Accuracy96.3 | 68 | |
| Image Classification | Average 11 datasets | -- | 52 | |
| Few-shot Image Classification | all-to-all setting | Accuracy83.5 | 31 | |
| Few-shot Image Classification | FGVC-Aircraft (test) | Top-1 Accuracy54.7 | 31 |