Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

About

A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.

Taichi Iki, Akiko Aizawa• 2021

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy73.82
664
Natural Language UnderstandingGLUE (dev)--
504
Visual Question AnsweringVQA v2 (test-std)
Accuracy74.02
466
Natural Language Visual ReasoningNLVR2 (test-p)
Accuracy79.98
327
Natural Language Visual ReasoningNLVR2 (dev)
Accuracy79.12
288
Visual EntailmentSNLI-VE (test)
Overall Accuracy79.38
197
Image CaptioningNoCaps
CIDEr80.9
101
Visual EntailmentSNLI-VE (dev)
Accuracy79.39
70
Image CaptioningCOCO Caption
CIDEr140
55
Visual Question AnsweringVQA Karpathy (test)
Overall Accuracy70.5
21
Showing 10 of 10 rows

Other info

Follow for update