FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment
About
False negatives pose a critical challenge in vision-language pretraining (VLP) due to the many-to-many correspondence between images and texts in large-scale datasets. These false negatives introduce conflicting supervision signals that degrade the learned embedding space and diminish the effectiveness of hard negative sampling. In this paper, we propose FALCON (False-negative Aware Learning of COntrastive Negatives), a learning-based mini-batch construction strategy that adaptively balances the trade-off between hard and false negatives during VLP. Rather than relying on fixed heuristics, FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction, guided by a proxy for cross-modal alignment improvement. Experimental results demonstrate that FALCON significantly improves performance across three vision-language learning frameworks (ALBEF, BLIP-2, SigLIP-2) and a broad range of downstream tasks and evaluation settings, underscoring its effectiveness and robustness in mitigating the impact of false negatives.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy75.62 | 706 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy75.78 | 486 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy82.28 | 346 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy82.61 | 307 | |
| Visual Question Answering | VQA (test-dev) | -- | 147 | |
| Visual Question Answering | VQA (test-std) | Accuracy71.36 | 120 | |
| Image Retrieval | MS-COCO | R@583.6 | 69 | |
| Text Retrieval | MSCOCO | Recall@178.7 | 30 |