Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VladVA: Discriminative Fine-tuning of LVLMs

About

Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, Georgios Tzimiropoulos• 2024

Related benchmarks

TaskDatasetResultRank
Image RetrievalFlickr30K
R@185
144
Text RetrievalFlickr30K
R@194.3
75
Compositional Vision-Language ReasoningWinoground
Text Score40.5
47
Zero-shot Image ClassificationImageNet zero-shot
Top-1 Accuracy70.6
35
Text RetrievalCOCO
R@172.9
28
Image RetrievalCOCO
R@159
22
Language CompositionalitySugarCrepe (test)
Replace: Object (R@1)98.1
21
Image-Text Compositionality EvaluationSugarCrepe ++ (test)
Swap Object ITT56.1
17
Text-to-Image RetrievalNoCaps
Recall@172.3
17
Image-to-Text RetrievalNoCaps
R@185.7
17
Showing 10 of 10 rows

Other info

Follow for update