Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Silkie: Preference Distillation for Large Visual Language Models

About

This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy52.6
1043
Visual Question AnsweringGQA--
963
Multimodal UnderstandingMMBench--
367
Multimodal ReasoningMM-Vet
MM-Vet Score49.9
281
Hallucination EvaluationMMHal-Bench
MMHal Score3.19
174
Multimodal UnderstandingMME--
158
Hallucination EvaluationPOPE--
132
Hallucination EvaluationAMBER
F1 Score87.6
71
Multi-modal UnderstandingSEED-Bench (overall)
Overall Score59.3
40
Hallucination assessmentObject-HalBench
Mention Hallucination Rate13.4
39
Showing 10 of 17 rows

Other info

Code

Follow for update