Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-Modal Representation Learning with Text-Driven Soft Masks

About

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.

Jaeyoo Park, Bohyung Han• 2023

Related benchmarks

TaskDatasetResultRank
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy81.6
327
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy80.6
288
Visual EntailmentSNLI-VE (test)
Overall Accuracy80.6
197
Visual EntailmentSNLI-VE (val)
Overall Accuracy80.9
109
Text-to-Image RetrievalFlickr30k (1K)
R@180.1
48
Image-to-Text RetrievalMS COCO 5K
R@10.723
46
Text-to-Image RetrievalMS COCO 5K
R@154.1
39
Image-to-Text RetrievalFlickr30k (1K)
R@193.4
30
Text RetrievalFlickr30k (1K)
R@195.4
30
Text RetrievalMSCOCO (5K)
R@176.6
29
Showing 10 of 14 rows

Other info

Follow for update