Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
About
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Retrieval | Flickr30k (test) | Recall@168.4 | 525 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@183 | 472 | |
| Image Classification | ImageNet | Top-1 Accuracy65.1 | 343 | |
| Image Classification | CIFAR100 | Accuracy68 | 301 | |
| Image Classification | CIFAR10 | Accuracy (%)91.4 | 282 | |
| Compositional Vision-Language Reasoning | Winoground | Text Score30.8 | 61 | |
| Image-to-Text Retrieval | DOCCI | -- | 45 | |
| Text-to-Image Retrieval | DOCCI | -- | 45 | |
| Image-to-Text Retrieval | Urban1k | -- | 36 | |
| Image Classification | ImageNet-1K | Accuracy60.8 | 33 |