Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
About
A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy73.82 | 664 | |
| Natural Language Understanding | GLUE (dev) | -- | 504 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy74.02 | 466 | |
| Natural Language Visual Reasoning | NLVR2 (test-p) | Accuracy79.98 | 327 | |
| Natural Language Visual Reasoning | NLVR2 (dev) | Accuracy79.12 | 288 | |
| Visual Entailment | SNLI-VE (test) | Overall Accuracy79.38 | 197 | |
| Image Captioning | NoCaps | CIDEr80.9 | 101 | |
| Visual Entailment | SNLI-VE (dev) | Accuracy79.39 | 70 | |
| Image Captioning | COCO Caption | CIDEr140 | 55 | |
| Visual Question Answering | VQA Karpathy (test) | Overall Accuracy70.5 | 21 |