Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization

About

The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in https://github.com/taco-group/Re-Align.

Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, Zhengzhong Tu• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.1	2019
Visual Question Answering	TextVQA	Accuracy53.3	1453
Visual Question Answering	VQA v2	Accuracy76.3	1429
Hallucination Evaluation	AMBER	CHAIR6.1	222
Hallucination Evaluation	HallusionBench	Accuracy47.62	153
Hallucination Evaluation	Object-HalBench	CHAIR Score (s)38.4	78
Science Question Answering	ScienceQA	IMG Score69	64
Vision-Language Understanding	MM-Vet	Total Score33.5	43
Hallucination Evaluation	MOH	HR^D49.9	21

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord