Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
About
Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | EuroSAT | Accuracy52.63 | 497 | |
| Image Classification | Food-101 | Accuracy86.69 | 494 | |
| Image Classification | DTD | Accuracy48.06 | 487 | |
| Image Classification | Flowers102 | Accuracy73.12 | 478 | |
| Image Classification | SUN397 | Accuracy67.97 | 425 | |
| Image Classification | UCF101 | Top-1 Acc71.03 | 404 | |
| Image Classification | ImageNet | Top-1 Accuracy71.65 | 324 | |
| Image Classification | Aircraft | Accuracy25.27 | 302 | |
| Image Classification | StanfordCars | Accuracy66.78 | 266 | |
| Image Classification | Caltech101 | Base Accuracy98.07 | 129 |