ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
About
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning steps. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text, and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr0.146 | 682 | |
| Image Captioning | MS-COCO | CIDEr34.5 | 61 | |
| Image Captioning | MSCOCO | BLEU@47 | 27 | |
| Image Captioning | COCO (test) | CIDEr14.6 | 27 | |
| Image Captioning | COCO (test) | CIDEr14.6 | 13 | |
| Visual Captioning | MS-COCO English | BLEU@42.6 | 9 | |
| Image Captioning | Flickr30K | BLEU-45.4 | 8 | |
| Image Captioning | MS COCO random subset of 100 images (test) | BLEU-40.00e+0 | 5 | |
| Visual Captioning | MSR-VTT English | BLEU@42.3 | 5 |