Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fine-Grained Semantically Aligned Vision-Language Pre-Training

About

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs. The repository of this work is at https://github.com/YYJMJC/LOUPE.

Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, Siliang Tang• 2022

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr1.378
682
Image ClassificationFood-101--
494
Image ClassificationStanford Cars--
477
Text-to-Image RetrievalFlickr30K
R@176.3
460
Image ClassificationImageNet
Top-1 Accuracy85.7
429
Image ClassificationSUN397--
425
Image ClassificationAircraft
Accuracy80.2
302
Visual GroundingRefCOCO+ (val)
Accuracy22.9
171
Visual GroundingRefCOCO+ (testB)
Accuracy23.6
169
Visual GroundingRefCOCO+ (testA)
Accuracy23.3
168
Showing 10 of 26 rows

Other info

Code

Follow for update