SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning

About

With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.

Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Courn\`ede, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy77	705
Image Classification	Food-101	Accuracy88.9	590
Image Classification	Flowers102	Accuracy77	558
Image Classification	Food101	--	457
Image Classification	FGVC Aircraft	Accuracy30.9	59
Image Classification	ImageNet V2 1.0 (test)	Top-1 Accuracy68.9	54
Image Classification	ImageNet	Acc74.5	45
Image Classification	Oxford-IIIT Pets	Accuracy93.9	33
Image Classification	UCF-101	Accuracy74.9	30
Image Classification	EuroSAT	Accuracy58.3	26

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord