Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

About

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet V2
Top-1 Acc55.69
749
Image ClassificationStanford Cars
Accuracy74
660
Image ClassificationDTD
Accuracy65.96
599
Image ClassificationImageNet-R
Top-1 Acc76.6
581
Image ClassificationEuroSAT
Accuracy85.8
569
Image ClassificationFlowers102
Accuracy97.4
558
Image ClassificationUCF101
Top-1 Acc84
527
Text-to-Image RetrievalFlickr30k (test)
Recall@178.74
525
ClassificationCars
Accuracy74
492
Image ClassificationDTD
Accuracy71.7
487
Showing 10 of 216 rows
...

Other info

Follow for update