MetaTPT: Meta Test-time Prompt Tuning for Vision-Language Models

About

Vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization but remain sensitive to domain shifts at test time. Test-time prompt tuning (TPT) mitigates this issue by adapting prompts with fixed augmentations, which may falter in more challenging settings. In this work, we propose Meta Test-Time Prompt Tuning (MetaTPT), a meta-learning framework that learns a self-supervised auxiliary task to guide test-time prompt tuning. The auxiliary task dynamically learns parameterized augmentations for each sample, enabling more expressive transformations that capture essential features in target domains. MetaTPT adopts a dual-loop optimization paradigm: an inner loop learns a self-supervised task that generates informative views, while the outer loop performs prompt tuning by enforcing consistency across these views. By coupling augmentation learning with prompt tuning, MetaTPT improves test-time adaptation under domain shifts. Extensive experiments demonstrate that MetaTPT achieves state-of-the-art performance on domain generalization and cross-dataset benchmarks.

Yuqing Lei, Yingjun Du, Yawen Huang, Xiantong Zhen, Ling Shao• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	UCF101	Top-1 Acc72.24	527
Image Classification	DTD	Accuracy48.88	487
Image Classification	Food101	Accuracy87.61	457
Image Classification	SUN397	Accuracy69.17	450
Image Classification	StanfordCars	Accuracy69.5	384
Image Classification	OxfordPets	Accuracy92.79	298
Image Classification	FGVCAircraft	Accuracy29.05	289
Image Classification	Caltech101	Accuracy94.9	228
Image Classification	EuroSAT	Accuracy54.26	226
Image Classification	Oxford Flowers	Top-1 Accuracy74.22	83

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord