A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
About
Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x larger model, PICa. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at \url{https://github.com/woojeongjin/FewVLM}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy51.1 | 1165 | |
| Visual Question Answering | GQA | Accuracy35.7 | 963 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy47.7 | 664 | |
| Visual Question Answering | OK-VQA (test) | Accuracy23.1 | 296 | |
| 5-way Classification | miniImageNet (test) | -- | 231 | |
| Visual Question Answering | GQA (test-dev) | Accuracy29.3 | 178 | |
| Visual Question Answering | VQAv2 | Accuracy47.7 | 177 | |
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)51.1 | 143 | |
| Image Captioning | Flickr30k (test) | CIDEr37 | 103 | |
| Image Captioning | NoCaps | CIDEr47.7 | 101 |