ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

About

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning steps. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text, and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github.com/YoadTew/zero-shot-image-to-text.

Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf• 2021

Related benchmarks

Task	Dataset	Result
Image Captioning	MS COCO Karpathy (test)	CIDEr0.146	706
Image Captioning	MS-COCO	CIDEr34.5	81
Image Captioning	MSCOCO	BLEU@47	27
Image Captioning	COCO (test)	CIDEr14.6	27
Image Captioning	COCO	CIDEr14.6	17
Image Captioning	COCO (test)	CIDEr14.6	13
Visual Captioning	MS-COCO English	BLEU@42.6	9
Two-term Subtraction	IRPD q2image	Accuracy (6-option)18.7	8
Image Captioning	Flickr30K	BLEU-45.4	8
Two-term Subtraction	IRPD q2text	Accuracy (6-option)14.5	8

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord