Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

About

Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.

Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, R. Venkatesh Babu• 2023

Related benchmarks

Task	Dataset	Result
Domain Generalization	VLCS	Accuracy81.9	270
Domain Generalization	DomainBed v1.0 (test)	Average Accuracy77.73	71
Image Classification	PACS, VLCS, OfficeHome, TerraIncognita, DomainNet out-of-domain	PACS Accuracy96.7	42
Domain Generalization	DomainBed (OH, TI, VLCS, PACS, DN) (test)	Accuracy (OH)87.38	33
Domain Generalization	PACS OOD (test)	Average Accuracy94.94	31
Domain Generalization	PACS, VLCS, OfficeHome, TerraIncognita, DomainNet	PACS Accuracy96.7	27
Domain Generalization	DomainNet (out-of-domain)	Accuracy59.38	25
Domain Generalization	OfficeHome DomainBed (OOD)	Avg OOD Accuracy85.74	16
Image Classification	Camelyon17-WILDS (test)	--	16
Breast cancer metastases classification	Camelyon17-WILDS (test)	Center 1 Accuracy96.32	8

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord