Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

About

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.

Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang• 2022

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy78.03
337
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy83.48
327
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy83.3
288
Text-to-Image RetrievalMSCOCO 5K (test)
R@179.3
286
Visual EntailmentSNLI-VE (test)
Overall Accuracy81.39
197
Visual EntailmentSNLI-VE (val)
Overall Accuracy81.4
109
Image-Text RetrievalFlickr30K 1K (test)
IR@183.8
10
Showing 7 of 7 rows

Other info

Code

Follow for update