ClipCap: CLIP Prefix for Image Captioning
About
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github.com/rmokady/CLIP_prefix_caption.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr113.1 | 682 | |
| Visual Question Answering | A-OKVQA | Acc30.9 | 175 | |
| Video Captioning | MSR-VTT (test) | CIDEr12.5 | 121 | |
| Image Captioning | Flickr30k (test) | CIDEr41.2 | 103 | |
| Image Retrieval | MS-COCO (test) | -- | 98 | |
| Image Captioning | nocaps (val) | CIDEr (Overall)65.8 | 93 | |
| Visual Question Answering | A-OKVQA (test) | Accuracy15.8 | 79 | |
| Image Captioning | MS-COCO | CIDEr113.1 | 61 | |
| Image Captioning | NoCaps (test) | CIDEr (overall)63.4 | 61 | |
| Visual Question Answering | A-OKVQA (val) | Accuracy0.181 | 56 |