Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FlexCap: Describe Anything in Images in Controllable Detail

About

We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy65.6
664
Video Question AnsweringMSRVTT-QA (test)
Accuracy25
371
Video Question AnsweringMSVD-QA (test)
Accuracy39.5
274
Visual Question AnsweringOKVQA (val)
VQA Score52.1
101
Visual Question AnsweringVizWiz (test-dev)
Accuracy41.8
65
Visual Question AnsweringGQA balanced (test-dev)
Accuracy48.8
32
Dense CaptioningVisual Genome
mAP16.2
16
Dense CaptioningVisual Genome (test)
mAP16.2
13
Region ClassificationMS-COCO
mAP85
10
Dense CaptioningVisual Genome Karpathy (test)
mAP16.2
7
Showing 10 of 11 rows

Other info

Code

Follow for update