LLaVA-Critic: Learning to Evaluate Multimodal Models
About
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | RewardBench | Chat Score96.9 | 216 | |
| Multimodal Reward Modeling | VL-RewardBench | Accuracy41.2 | 102 | |
| Multimodal Reward Modeling | Multimodal RewardBench | Accuracy63.5 | 50 | |
| Correction | VISCO full 1.0 (test) | Correction Gain58.9 | 46 | |
| Multimodal Reward Modeling | RewardBench Multimodal | Safety Score78 | 44 | |
| Reward Modeling | VLRewardBench (test) | General54.6 | 39 | |
| Critique | VISCO 1.0 (test) | VISCore42.6 | 26 | |
| Multimodal Evaluation Consistency | MLLM-as-a-Judge | CO Score38.2 | 22 | |
| Multimodal Evaluation Consistency | MLLM-as-a-Judge, RichHF-18K, GenAI-Bench | Average Score39.8 | 22 | |
| Multimodal Reward Modeling | RewardBench MM-RLHF | MCQ Score66.67 | 20 |