SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

About

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome. As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy85.4	887
Visual Mathematical Reasoning	MathVista	Accuracy71.3	448
Chart Question Answering	ChartQA	Accuracy88.5	404
Multimodal Perception and Cognition	MME	Overall Score2.40e+3	344
Multimodal Model Evaluation	MMBench	Accuracy85.4	265
Multimodal Evaluation	MMStar	Accuracy66.7	177
Multimodal Reasoning	MMMU-Pro	Accuracy38.8	171
Multimodal Reasoning	MMMU (val)	Accuracy56.7	168
Multimodal Mathematical Reasoning	MathVista mini	Accuracy0.706	124
Multimodal Mathematical Reasoning	MathVerse	Average Score48.8	70

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord