WATT: Weight Average Test-Time Adaptation of CLIP

About

Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{https://github.com/Mehrdad-Noori/WATT.git}.

David Osowiechi, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Ali Bahri, Milad Cheraghalikhani, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100	Accuracy70.74	691
Image Classification	CIFAR-10	Accuracy91.41	564
Image Classification	PACS (test)	Average Accuracy97.51	279
Image Classification	PACS	Overall Average Accuracy97.51	270
Image Classification	DomainNet (test)	Average Accuracy55.5	266
Image Classification	DomainNet	Accuracy (ClipArt)68.3	238
Image Classification	CIFAR-10-C	Accuracy80.06	179
Image Classification	OfficeHome	Average Accuracy86.45	161
Multi-class classification	VLCS	Acc (Caltech)99.51	139
Image Classification	CIFAR-10C Severity Level 5 (test)	Average Error Rate (Severity 5)66.57	136

Showing 10 of 33 rows

Other info

Code

Follow for update

@wizwand_team Discord