Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

About

Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

Jungwon Choi, Eunwoo Kim• 2026

Related benchmarks

Task	Dataset	Result
Classification	Cars	Accuracy69.55	571
Image Classification	DTD	Accuracy53.72	487
Image Classification	Aircraft	Accuracy29.11	340
Image Classification	Pets	Accuracy90.32	320
Image Classification	Food101	Accuracy87.41	177
Image Classification	UCF101	Accuracy72.24	64
Image Classification	Caltech101	Accuracy94.85	40
Image Classification	SUN397	Accuracy70.32	28
Image Classification	ImageNet and OOD variants 1.0 (test)	ImageNet-A Accuracy62.4	18
Image Classification	EuroSAT	Top-1 Accuracy51.69	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord