Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

About

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	Flickr30k (test)	Recall@168.4	528
Image-to-Text Retrieval	Flickr30k (test)	R@183	472
Image Classification	ImageNet	Top-1 Accuracy65.1	384
Image Classification	CIFAR100	Accuracy68	301
Image Classification	CIFAR10	Accuracy (%)91.4	282
Image-to-Text Retrieval	DOCCI	--	66
Text-to-Image Retrieval	DOCCI	--	66
Compositional Vision-Language Reasoning	Winoground	Text Score30.8	61
Image-to-Text Retrieval	Urban1k	--	36
Image Classification	ImageNet-1K	Accuracy60.8	33

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord