Learning to Prompt for Vision-Language Models

About

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu• 2021

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy89.76	3381
Image Classification	ImageNet-1k (val)	--	1498
Person Re-Identification	Duke MTMC-reID (test)	Rank-133.7	1023
Image Classification	ImageNet 1k (test)	Top-1 Accuracy71.9	939
Image Classification	Tiny ImageNet (test)	Accuracy47.78	859
Image Classification	ImageNet V2	Top-1 Acc64.2	767
Image Classification	ImageNet A	Top-1 Acc49.71	723
Image Classification	Stanford Cars	Accuracy83.1	705
Image Classification	ImageNet-R	Top-1 Acc75.21	622
Image Classification	DTD	Accuracy69.87	610

Showing 10 of 1088 rows

...

Other info

Code

Follow for update

@wizwand_team Discord