Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

About

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that the prompt-tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). We make the data and code for this paper publicly available at https://github.com/thunlp/CPT.

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun• 2021

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy31.9
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy32.3
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.361
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy36.5
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy36.7
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy28.8
235
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy35.2
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy30.3
196
Referring Expression RetrievalRefCOCO (val)
Acc@132.2
16
Referring Expression ComprehensionRefCOCOg UMD (val)
Accuracy36.7
12
Showing 10 of 10 rows

Other info

Follow for update