C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion

About

In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-language models such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.

Hee Suk Yoon, Eunseop Yoon, Joshua Tian Jin Tee, Mark Hasegawa-Johnson, Yingzhen Li, Chang D. Yoo• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	--	749
Image Classification	ImageNet A	Top-1 Acc51.6	698
Image Classification	Stanford Cars	Accuracy77.5	660
Image Classification	ImageNet-R	Top-1 Acc76	581
Image Classification	Food-101	Accuracy88.9	570
Image Classification	Flowers102	Accuracy76.5	558
Image Classification	DTD	Accuracy46	487
Image Classification	Food101	--	457
Fine-grained visual classification	FGVC-Aircraft (test)	Top-1 Acc24	312
Fine-grained Image Classification	Stanford Cars	Accuracy65.8	284

Showing 10 of 152 rows

...

Other info

Follow for update

@wizwand_team Discord