Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

About

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a "second generation" of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at https://github.com/BatsResearch/menghini-neurips23-code.

Cristina Menghini, Andrew Delworth, Stephen H. Bach• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	DTD	Accuracy65.3	599
Image Classification	EuroSAT	Accuracy96.97	569
Image Classification	Flowers102	Accuracy86.26	558
Image Classification	UCF101	Top-1 Acc70.5	527
Image Classification	DTD	Accuracy50.9	487
Image Classification	RESISC45	Accuracy82.19	472
Image Classification	Food101	Accuracy89.16	457
Image Classification	MNIST	Accuracy74.06	417
Image Classification	StanfordCars	Accuracy60.84	384
Image Classification	FGVC-Aircraft (test)	--	322

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord