Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding
About
Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Image Segmentation | RefCOCO (val) | mIoU28.85 | 274 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU28.32 | 267 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU3.08e+3 | 245 | |
| Referring Image Segmentation | RefCOCO+ (val) | mIoU28.46 | 194 | |
| Referring Image Segmentation | RefCOCO (test-B) | mIoU28.53 | 186 | |
| Referring Image Segmentation | RefCOCO+ (testA) | mIoU2.98e+3 | 112 | |
| Text-to-Image Retrieval | DCI | R@172.4 | 106 | |
| Image-to-Text Retrieval | DCI | R@169.7 | 100 | |
| Referring Image Segmentation | G-Ref (val) | mIoU31.39 | 95 | |
| Image-to-Text Retrieval | DOCCI | R@180.6 | 45 |