Multi-Modal Representation Learning with Text-Driven Soft Masks
About
We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy81.6 | 327 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy80.6 | 288 | |
| Visual Entailment | SNLI-VE (test) | Overall Accuracy80.6 | 197 | |
| Visual Entailment | SNLI-VE (val) | Overall Accuracy80.9 | 109 | |
| Text-to-Image Retrieval | Flickr30k (1K) | R@180.1 | 48 | |
| Image-to-Text Retrieval | MS COCO 5K | R@10.723 | 46 | |
| Text-to-Image Retrieval | MS COCO 5K | R@154.1 | 39 | |
| Image-to-Text Retrieval | Flickr30k (1K) | R@193.4 | 30 | |
| Text Retrieval | Flickr30k (1K) | R@195.4 | 30 | |
| Text Retrieval | MSCOCO (5K) | R@176.6 | 29 |