Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

About

Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose a hierarchical vision-language learning principle for understanding scenes as part-to-whole compositions: before forming a whole-scene representation, a model should uncover what semantic parts appear where in the image. To this end, we propose CAFT (Cross-domain Alignment of Forests and Trees), a vision-language model that jointly learns local text-region alignment at intermediate representations and global image-text alignment at the final representation. Exploiting the organization of long captions, where local descriptions often correspond to scene parts, CAFT employs a fine-to-coarse image encoder and a part-whole text encoder to discover localized part semantics and progressively compose them into a global image-text representation. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that CAFT learns fine-grained representations that localize textual semantics in image regions without explicit region-level supervision.

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu• 2026

Related benchmarks

TaskDatasetResultRank
Referring Image SegmentationRefCOCO (val)
mIoU28.85
274
Referring Image SegmentationRefCOCO+ (test-B)
mIoU28.32
267
Referring Image SegmentationRefCOCO (test A)
mIoU3.08e+3
245
Referring Image SegmentationRefCOCO+ (val)
mIoU28.46
194
Referring Image SegmentationRefCOCO (test-B)
mIoU28.53
186
Referring Image SegmentationRefCOCO+ (testA)
mIoU2.98e+3
112
Text-to-Image RetrievalDCI
R@172.4
106
Image-to-Text RetrievalDCI
R@169.7
100
Referring Image SegmentationG-Ref (val)
mIoU31.39
95
Image-to-Text RetrievalDOCCI
R@180.6
45
Showing 10 of 20 rows

Other info

Follow for update