Going Beyond Nouns With Vision & Language Models Using Synthetic Data

About

Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy.

Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, G\"ul Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky• 2023

Related benchmarks

Task	Dataset	Result
Compositional Vision-Language Reasoning	Winoground	Text Score30	61
Compositional Reasoning	VL-Checklist	Attribute Score75.34	37
Multimodal Compositional Understanding	ARO	Relational Score71.4	27
Image-Text Matching	Winoground	Text Agreement Score43.25	26
Compositional Reasoning	ARO	Relation Score71.4	17
Vision-Language Probing	VL-CheckList (test)	Object: Avg69.4	17
Vision-Language Compositional Reasoning	ARO	Accuracy0.692	14
Zero-shot Classification	ELEVATER ImageNet 21 datasets	Average Accuracy55.3	7
Image-Text Matching	Winoground clean	Text Agreement Score52.63	4
Vision-Language Compositional Reasoning	ARO (test)	VG-Rel71.4	4

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord