FlexCap: Describe Anything in Images in Controllable Detail

About

We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy65.6	712
Video Question Answering	MSRVTT-QA (test)	Accuracy25	376
Video Question Answering	MSVD-QA (test)	Accuracy39.5	279
Visual Question Answering	OKVQA (val)	VQA Score52.1	101
Visual Question Answering	VizWiz (test-dev)	Accuracy41.8	65
Visual Question Answering	GQA balanced (test-dev)	Accuracy48.8	32
Dense Captioning	Visual Genome	mAP16.2	16
Dense Captioning	Visual Genome (test)	mAP16.2	13
Region Classification	MS-COCO	mAP85	10
Dense Captioning	Visual Genome Karpathy (test)	mAP16.2	7

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord