Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

About

Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when estimated from pseudo-labels, can provably enhance embedding quality. Based on this insight, we propose Mint, a simple test-time adaptation method that maximizes pseudo-label-based inter-class variance on the fly using a mean accumulator and a gradient accumulator. Mint operates effectively with small batch sizes and consistently improves performance across multiple corruption benchmarks and CLIP architectures. Our code is available at https://github.com/baowenxuan/Mint .

Wenxuan Bao, Ruxi Deng, Jingrui He• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	DomainNet (test)	Average Accuracy56.1	266
Image Classification	DomainNet	Accuracy (ClipArt)69.3	238
Image Classification	CIFAR-10C Severity Level 5 (test)	Average Error Rate (Severity 5)70.2	136
Image Classification	ImageNet-C Severity 5 (test)	Mean Error Rate (Severity 5)29.18	132
Image Classification	CIFAR-100-C	Accuracy (Corruption)53.8	109
Image Classification	CIFAR-100-C v1 (test)	Error Rate (Average)37.09	60
Image Classification	CIFAR-100C Level 5 (test)	Mean Accuracy (C5)42.41	56
Image Classification	ImageNet-C	Accuracy (Brightness)67.7	54
Image Classification	ImageNet-C 1.0 (test)	--	53
Image Classification	CIFAR10-C	Mean Accuracy (mAcc)80.59	41

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord