Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

About

Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.

Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	--	767
Image Classification	ImageNet A	Top-1 Acc52.3	723
Image Classification	Food-101	Accuracy85.17	590
Image Classification	SUN397	Accuracy68.29	116
Image Classification	FGVC Aircraft	Accuracy24.06	59
Fine-grained Image Classification	FGVC Aircraft	--	50
Image Classification	Oxford-IIIT Pets	Accuracy91.55	33
Image Classification	Oxford Flowers 102	Accuracy69.35	32
Image Classification	ImageNet-R	Accuracy76.7	29
Fine grained classification	FGVC-Aircraft (test)	Accuracy0.2475	23

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord