Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

About

False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.

Myunsoo Kim, Seongwoong Shim, Byung-Jun Lee• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75.62
706
Visual Question AnsweringVQA v2 (test-std)
Accuracy75.78
486
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy82.28
346
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy82.61
307
Visual Question AnsweringVQA (test-dev)--
147
Visual Question AnsweringVQA (test-std)
Accuracy71.36
120
Image RetrievalMS-COCO
R@583.6
69
Text RetrievalMSCOCO
Recall@178.7
30
Showing 8 of 8 rows

Other info

Follow for update