VladVA: Discriminative Fine-tuning of LVLMs

About

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos• 2024

Related benchmarks

Task	Dataset	Result
Image Retrieval	Flickr30K	R@185	164
Text Retrieval	Flickr30K	R@194.3	120
Compositional Vision-Language Reasoning	Winoground	Text Score40.5	61
Text Retrieval	COCO	R@172.9	59
Image Retrieval	COCO	R@159	53
Zero-shot Image Classification	ImageNet zero-shot	Top-1 Accuracy70.6	35
Classification	Oxford Pets zero-shot	Accuracy (Zero-Shot)90.1	26
Language Compositionality	SugarCrepe (test)	Replace: Object (R@1)98.1	21
Image-Text Compositionality Evaluation	SugarCrepe ++ (test)	--	21
Text-to-Image Retrieval	NoCaps	Recall@172.3	17

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord