Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

About

Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
935
Multimodal EvaluationMME
Score2.52e+3
557
Multimodal EvaluationSEED-Bench
Accuracy76.36
80
Multimodal EvaluationMMStar
Accuracy64.13
46
Vision UnderstandingCVBench 2D
Accuracy74.91
22
Visual GroundingLisa Grounding
Accuracy74.55
18
Color UnderstandingColorBench
Accuracy35.39
18
Multimodal Visual Pattern UnderstandingMMVP
Accuracy80.33
16
Multimodal EvaluationMMT-Bench
Accuracy62.62
13
Multimodal UnderstandingSEED-Bench (cleaned)
Overall Score88.58
10
Showing 10 of 12 rows

Other info

Follow for update