Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification
About
Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. However, in several cases, their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Domain Generalization | VLCS | Accuracy81.9 | 238 | |
| Domain Generalization | DomainBed v1.0 (test) | Average Accuracy77.73 | 71 | |
| Domain Generalization | DomainBed (OH, TI, VLCS, PACS, DN) (test) | Accuracy (OH)87.38 | 33 | |
| Domain Generalization | PACS, VLCS, OfficeHome, TerraIncognita, DomainNet | PACS Accuracy96.7 | 27 | |
| Domain Generalization | DomainNet (out-of-domain) | Accuracy59.38 | 25 | |
| Domain Generalization | OfficeHome DomainBed (OOD) | Avg OOD Accuracy85.74 | 16 | |
| Image Classification | Camelyon17-WILDS (test) | -- | 16 | |
| Domain Generalization | PACS OOD (test) | Average Accuracy94.94 | 13 | |
| Breast cancer metastases classification | Camelyon17-WILDS (test) | Center 1 Accuracy96.32 | 8 | |
| Histopathology Image Classification | Kather19 (test) | Accuracy (ACC)92.08 | 6 |