OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

About

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

Kaichen Zhang, Keming Wu, Zuhao Yang, Bo Li, Kairui Hu, Bin Wang, Ziwei Liu, Xingxuan Li, Lidong Bing• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	MMMU	Accuracy57.8	208
Multimodal Reasoning	MathVision	Accuracy43.6	162
Multimodal Reasoning	LogicVista	Accuracy50	147
Multimodal Reasoning	MathVerse	Accuracy63.8	130
Visual Question Answering	MMStar	Accuracy69	100
Mathematical Reasoning	MathVerse mini	Accuracy63.8	83
Multimodal Reasoning	MMStar	Accuracy70	78
Multi-modal Reasoning	EMMA	Accuracy24.5	57
Mathematical Reasoning	MathVision (test)	Accuracy43.6	53
Visual Question Answering	RealWorldQA (test)	Accuracy69.4	47

Showing 10 of 34 rows

Other info

GitHub

Follow for update

@wizwand_team Discord