Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

About

Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer from degradation, as VQA datasets lack the diversity and complexity of the original text instruction datasets with which the underlying language model was trained. To address this degradation, we first collect a lightweight, 5k-sample VQA preference dataset where answers were annotated by Gemini for five quality metrics in a granular fashion and investigate standard Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO) and SteerLM algorithms. Our findings indicate that with DPO, we can surpass the instruction-following capabilities of the language model, achieving a 6.73 score on MT-Bench, compared to Vicuna's 6.57 and LLaVA's 5.99. This enhancement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9\% on MM-Vet, +6\% on LLaVA-Bench), with minimal alignment tax on visual knowledge benchmarks compared to the previous RLHF approach. In conclusion, we propose a distillation-based multi-modal alignment model with fine-grained annotations on a small dataset that restores and boosts MLLM's language capability after visual instruction tuning.

Shengzhi Li, Rongyu Lin, Shichao Pei• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Instruction Following	AlpacaEval	Win Rate86.4	423
Instruction Following	MT-Bench	MT-Bench Score6.73	287
Hallucination Evaluation	POPE	Accuracy83.7	281
Hallucination Evaluation	AMBER	CHAIR6	267
Multimodal Reasoning	MMBench	--	180
Multimodal Understanding	LLaVA-Bench	Overall Score64.6	94
Hallucination Evaluation	MMHal	Score2.45	62
Visual Reasoning and Instruction Following	MM-Vet	Overall Score41.2	23
Captioning Hallucination	ObjHal	CRs19	21

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord