CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning

About

The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.

Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hanchao Yu• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Retrieval	MMEB	Classification Score65.2	94
Multimodal Embedding	MMEB	Classification Accuracy65.2	79
Image Embedding	MMEB v1 (test)	Classification65.2	70
Multimodal Retrieval	MMEB Image V2	Overall Score67.6	37
Multimodal Retrieval and Understanding	MMEB V2 (test)	Image CLS Acc65.2	37
Multimodal Embedding Evaluation	MMEB V2 (test)	Image CLS Hit@163.6	35
Multimodal Visual Document Retrieval	MMEB Visual Document portion v2	ViDoRe V1 Score70.7	31
Universal Multimodal Retrieval	MMEB Full v2 (test)	Overall Average Score60.6	18
Retrieval	MMEB v2	Image Retrieval Score67.6	18
Multimodal Retrieval	MMEB Image v2 (test)	CLS (Hit@1)63.6	18

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord