Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

About

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K
R@143.9
531
Image-to-Text RetrievalFlickr30K
R@164.8
429
Text-to-Image RetrievalCOCO
Recall@148.1
156
Image-to-Text RetrievalCOCO
R@165.1
149
Text-to-Image RetrievalDCI
R@173.5
79
Image-to-Text RetrievalDCI
R@172.8
79
Image-to-Text RetrievalDOCCI
R@185.2
38
Text-to-Image RetrievalDOCCI
Recall@185.6
38
Image-to-Text RetrievalUrban1k
R@195.2
36
Text-to-Image RetrievalUrban1k
R@195.2
28
Showing 10 of 12 rows

Other info

Follow for update