CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
About
Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO v1 (val) | Top-1 Accuracy84.29 | 49 | |
| Referring Expression Comprehension | ReferItGame (test) | Top-1 Acc70.89 | 47 | |
| Visual Grounding | RefFLIR 1.0 (val) | Accuracy @ 0.5 IoU43.68 | 29 | |
| Referring Expression Comprehension | Flickr30K Entities (test) | Top-1 Accuracy81.99 | 17 | |
| Visual Grounding | RefFLIR RGBT-Ground (val) | Acc@0.50.4557 | 10 | |
| Visual Grounding | RefFLIR RGBT-Ground (test) | Accuracy @ 0.5 IoU46.01 | 10 | |
| Visual Grounding | RefM3FD RGBT-Ground (val) | Acc@0.534.52 | 10 | |
| Visual Grounding | RefM3FD RGBT-Ground (test) | Accuracy @ 0.538.92 | 10 | |
| Visual Grounding | RefMFAD RGBT-Ground (val) | Acc@0.50.4702 | 10 | |
| Visual Grounding | RefMFAD RGBT-Ground (test) | Accuracy @ 0.5 IoU47.52 | 10 |