Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Boosting Vision-Language Models with Transduction

About

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.

Maxime Zanella, Beno\^it G\'erin, Ismail Ben Ayed• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet 1k (test)
Top-1 Accuracy77.8
798
Image ClassificationEuroSAT
Accuracy83
497
Image ClassificationDTD
Accuracy65.1
487
Image ClassificationDTD
Accuracy15.3
419
Image ClassificationUCF101
Top-1 Acc82.1
404
ClassificationCars
Accuracy79.8
314
Image ClassificationAircraft
Accuracy38.6
302
Image ClassificationPets
Accuracy93.8
204
Image ClassificationImageNet
Accuracy71.8
184
Image ClassificationCaltech
Accuracy94
98
Showing 10 of 46 rows

Other info

Code

Follow for update