Boosting Vision-Language Models with Transduction
About
Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy77.8 | 848 | |
| Image Classification | EuroSAT | Accuracy83 | 569 | |
| Image Classification | DTD | Accuracy65.1 | 542 | |
| Image Classification | DTD | Accuracy15.3 | 485 | |
| Image Classification | UCF101 | Top-1 Acc82.1 | 455 | |
| Classification | Cars | Accuracy79.8 | 395 | |
| Image Classification | RESISC45 | Accuracy79.5 | 349 | |
| Image Classification | Aircraft | Accuracy38.6 | 333 | |
| Fine-grained visual classification | FGVC-Aircraft (test) | Top-1 Acc26.9 | 312 | |
| Image Classification | Pets | Accuracy93.8 | 245 |