Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

About

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimization) and nocaps (in SPICE). On other downstream tasks, e.g., image-text retrieval, we also achieve competitive performance.

Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang• 2021

Related benchmarks

TaskDatasetResultRank
Image CaptioningMS COCO Karpathy (test)
CIDEr122.8
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy76.64
664
Visual Question AnsweringVQA v2 (test-std)--
466
Text-to-Image RetrievalCOCO
Recall@159.2
130
Image Captioningnocaps (val)
CIDEr (Overall)94.3
93
Image CaptioningCOCO (Karpathy split)
CIDEr131.2
74
Image CaptioningCOCO
CIDEr131.2
31
Visual Question AnsweringVQA v2 (std)
Accuracy76.76
31
Visual Question AnsweringVQAv2 (test-std)
Accuracy76.76
30
Visual Question AnsweringVQA v2 (dev)
Accuracy76.64
30
Showing 10 of 13 rows

Other info

Follow for update