FlexCap: Describe Anything in Images in Controllable Detail
About
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy65.6 | 664 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy25 | 371 | |
| Video Question Answering | MSVD-QA (test) | Accuracy39.5 | 274 | |
| Visual Question Answering | OKVQA (val) | VQA Score52.1 | 101 | |
| Visual Question Answering | VizWiz (test-dev) | Accuracy41.8 | 65 | |
| Visual Question Answering | GQA balanced (test-dev) | Accuracy48.8 | 32 | |
| Dense Captioning | Visual Genome | mAP16.2 | 16 | |
| Dense Captioning | Visual Genome (test) | mAP16.2 | 13 | |
| Region Classification | MS-COCO | mAP85 | 10 | |
| Dense Captioning | Visual Genome Karpathy (test) | mAP16.2 | 7 |