CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

About

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that the prompt-tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). We make the data and code for this paper publicly available at https://github.com/thunlp/CPT.

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun• 2021

Related benchmarks

Task	Dataset	Result
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy31.9	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy32.3	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.361	346
Referring Expression Comprehension	RefCOCOg (test)	Accuracy36.5	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy36.7	300
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy28.8	244
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy35.2	216
Referring Expression Comprehension	RefCOCO (testB)	Accuracy30.3	213
Referring Expression Retrieval	RefCOCO (val)	Acc@132.2	16
Referring Expression Comprehension	RefCOCOg UMD (val)	Accuracy36.7	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord