Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

About

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning steps. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text, and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.

Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr0.146
682
Image CaptioningMS-COCO
CIDEr34.5
61
Image CaptioningMSCOCO
BLEU@47
27
Image CaptioningCOCO (test)
CIDEr14.6
27
Image CaptioningCOCO (test)
CIDEr14.6
13
Visual CaptioningMS-COCO English
BLEU@42.6
9
Image CaptioningFlickr30K
BLEU-45.4
8
Image CaptioningMS COCO random subset of 100 images (test)
BLEU-40.00e+0
5
Visual CaptioningMSR-VTT English
BLEU@42.3
5
Showing 9 of 9 rows

Other info

Code

Follow for update