LLaVA-Critic: Learning to Evaluate Multimodal Models

About

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.

Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, Chunyuan Li• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Chat Score96.9	216
Multimodal Reward Modeling	VL-RewardBench	Accuracy41.2	102
Multimodal Reward Modeling	Multimodal RewardBench	Accuracy63.5	50
Correction	VISCO full 1.0 (test)	Correction Gain58.9	46
Multimodal Reward Modeling	RewardBench Multimodal	Safety Score78	44
Reward Modeling	VLRewardBench (test)	General54.6	39
Critique	VISCO 1.0 (test)	VISCore42.6	26
Multimodal Evaluation Consistency	MLLM-as-a-Judge	CO Score38.2	22
Multimodal Evaluation Consistency	MLLM-as-a-Judge, RichHF-18K, GenAI-Bench	Average Score39.8	22
Multimodal Reward Modeling	RewardBench MM-RLHF	MCQ Score66.67	20

Showing 10 of 40 rows

Other info

Code

Follow for update

@wizwand_team Discord