MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models
About
Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | DTD | Accuracy48.88 | 419 | |
| Image Classification | UCF101 | Top-1 Acc72.24 | 404 | |
| Image Classification | Food101 | Accuracy87.61 | 309 | |
| Image Classification | StanfordCars | Accuracy69.5 | 266 | |
| Image Classification | SUN397 | Accuracy69.17 | 246 | |
| Image Classification | FGVCAircraft | Accuracy29.05 | 225 | |
| Image Classification | Caltech101 | Accuracy94.9 | 162 | |
| Image Classification | OxfordPets | Accuracy92.79 | 113 | |
| Image Classification | EuroSAT | Accuracy54.26 | 83 | |
| Image Classification | Oxford Flowers | Top-1 Accuracy74.22 | 78 |