Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

About

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

Eun Woo Im, Dhruv Madhwal, Vivek Gupta• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@130.35
531
Image-to-Text RetrievalFlickr30K
R@148.02
429
Object DetectionCOCO
mAP26.44
137
Compositional Scene UnderstandingWinoground
Text Alignment Score32.5
44
Visual Task AdaptationVTAB
VTAB Mean Accuracy42.19
31
Vision-Language Compositional ReasoningSugarCrepe++
Accuracy66.24
20
Text-to-Image Compositional UnderstandingSugarCrepe++ T2I
Accuracy61.05
15
Compositional UnderstandingSugarCrepe
Accuracy83
15
Image ClassificationImageNet-1K
Top-1 Accuracy44
15
Showing 9 of 9 rows

Other info

Follow for update