Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning
About
Few-shot recognition (FSR) aims to train a classification model with only a few labeled examples of each concept concerned by a downstream task, where data annotation cost can be prohibitively high. We develop methods to solve FSR by leveraging a pretrained Vision-Language Model (VLM). We particularly explore retrieval-augmented learning (RAL), which retrieves open data, e.g., the VLM's pretraining dataset, to learn models for better serving downstream tasks. RAL has been studied in zero-shot recognition but remains under-explored in FSR. Although applying RAL to FSR may seem straightforward, we observe interesting and novel challenges and opportunities. First, somewhat surprisingly, finetuning a VLM on a large amount of retrieved data underperforms state-of-the-art zero-shot methods. This is due to the imbalanced distribution of retrieved data and its domain gaps with the few-shot examples in the downstream task. Second, more surprisingly, we find that simply finetuning a VLM solely on few-shot examples significantly outperforms previous FSR methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issues, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data in the first stage and retraining the classifier on the few-shot data in the second stage. Extensive experiments on nine popular benchmarks demonstrate that SWAT significantly outperforms previous methods by >6% accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Fungi | Accuracy29.2 | 25 | |
| Few-shot Image Classification | Aves | Accuracy58.2 | 22 | |
| Fine-grained species classification | Fungi FungiTastic 16-shot (test) | Accuracy29.9 | 18 | |
| Fine-grained species classification | Insecta Species196 16-shot (test) | Accuracy63.8 | 18 | |
| Image Classification | Five Datasets 4-shot | Accuracy0.674 | 18 | |
| Image Classification | Five Datasets 8-shot | Accuracy71 | 18 | |
| Image Classification | Five Datasets 16-shot | Accuracy74 | 18 | |
| Fine-grained species classification | Mollusca Species196 16-shot (test) | Accuracy63.6 | 18 | |
| Fine-grained species classification | Weeds Species196 16-shot (test) | Accuracy80.7 | 18 | |
| Fine-grained species classification | iNaturalist Aves 16-shot 2018 (test) | Accuracy58.2 | 18 |