A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
About
Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made, we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups, and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular, we make two interesting, and surprising empirical observations. First, to outperform a simple Linear Probing baseline, these methods require to optimize their hyper-parameters on each target task. And second, they typically underperform -- sometimes dramatically -- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature, i.e., access to a large validation set and case-specific grid-search for optimal hyperparameters, we propose a novel approach that meets the requirements of real-world scenarios. More concretely, we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios, demonstrating that it consistently outperforms SoTA approaches, while yet being a much more efficient alternative.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet (INet) | Accuracy65 | 50 | |
| Few-shot Image Classification | 11 datasets average CLIP-based (ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) | Accuracy74.57 | 30 | |
| Image Classification | ImageNet 1k (source) | Top-1 Acc73.38 | 28 | |
| Few-shot Image Classification | Aves | Accuracy53.6 | 22 | |
| Image Classification | ImageNet Distribution Shifts Average of ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet, and ImageNet-A (test) | Average Accuracy60.04 | 19 | |
| Fine-grained species classification | Insecta Species196 16-shot (test) | Accuracy63.1 | 18 | |
| Fine-grained species classification | Fungi FungiTastic 16-shot (test) | Accuracy24.9 | 18 | |
| Fine-grained species classification | Mollusca Species196 16-shot (test) | Accuracy63.5 | 18 | |
| Fine-grained species classification | Weeds Species196 16-shot (test) | Accuracy76.9 | 18 | |
| Image Classification | Five Datasets 8-shot | Accuracy61.3 | 18 |