CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models
About
Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that the prompt-tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). We make the data and code for this paper publicly available at https://github.com/thunlp/CPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy31.9 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy32.3 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.361 | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy36.5 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy36.7 | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy28.8 | 235 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | Accuracy35.2 | 207 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy30.3 | 196 | |
| Referring Expression Retrieval | RefCOCO (val) | Acc@132.2 | 16 | |
| Referring Expression Comprehension | RefCOCOg UMD (val) | Accuracy36.7 | 12 |