Boosting Vision-Language Models with Transduction

About

Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.

Maxime Zanella, Beno\^it G\'erin, Ismail Ben Ayed• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet 1k (test)	Top-1 Accuracy77.8	880
Image Classification	DTD	Accuracy65.1	599
Image Classification	EuroSAT	Accuracy83	569
Image Classification	UCF101	Top-1 Acc82.1	527
Classification	Cars	Accuracy79.8	492
Image Classification	DTD	Accuracy15.3	487
Image Classification	RESISC45	Accuracy79.5	472
Image Classification	Aircraft	Accuracy38.6	340
Fine-grained visual classification	FGVC-Aircraft (test)	Top-1 Acc26.9	312
Image Classification	Pets	Accuracy93.8	308

Showing 10 of 80 rows

...

Other info

Code

Follow for update

@wizwand_team Discord