Going Beyond Nouns With Vision & Language Models Using Synthetic Data
About
Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Compositional Vision-Language Reasoning | Winoground | Text Score30 | 47 | |
| Compositional Reasoning | VL-Checklist | Attribute Score75.34 | 37 | |
| Multimodal Compositional Understanding | ARO | Relational Score71.4 | 27 | |
| Image-Text Matching | Winoground | Text Agreement Score43.25 | 26 | |
| Compositional Reasoning | ARO | Relation Score71.4 | 17 | |
| Vision-Language Probing | VL-CheckList (test) | Object: Avg69.4 | 17 | |
| Vision-Language Compositional Reasoning | ARO | Accuracy0.692 | 14 | |
| Zero-shot Classification | ELEVATER ImageNet 21 datasets | Average Accuracy55.3 | 7 | |
| Image-Text Matching | Winoground clean | Text Agreement Score52.63 | 4 | |
| Vision-Language Compositional Reasoning | ARO (test) | VG-Rel71.4 | 4 |