Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Boosting Vision-Language Models with Transduction

About

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.

Maxime Zanella, Beno\^it G\'erin, Ismail Ben Ayed• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet 1k (test)
Top-1 Accuracy77.8
848
Image ClassificationEuroSAT
Accuracy83
569
Image ClassificationDTD
Accuracy65.1
542
Image ClassificationDTD
Accuracy15.3
485
Image ClassificationUCF101
Top-1 Acc82.1
455
ClassificationCars
Accuracy79.8
395
Image ClassificationRESISC45
Accuracy79.5
349
Image ClassificationAircraft
Accuracy38.6
333
Fine-grained visual classificationFGVC-Aircraft (test)
Top-1 Acc26.9
312
Image ClassificationPets
Accuracy93.8
245
Showing 10 of 80 rows
...

Other info

Code

Follow for update