Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

About

Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.

Reuben Tan, Baolin Peng, Zhengyuan Yang, Hao Cheng, Oier Mees, Theodore Zhao, Andrea Tupini, Isar Meijier, Qianhui Wu, Yuncong Yang, Lars Liden, Yu Gu, Sheng Zhang, Xiaodong Liu, Lijuan Wang, Marc Pollefeys, Yong Jae Lee, Jianfeng Gao• 2025

Related benchmarks

Task	Dataset	Result
Spatial Reasoning	MindCube (tiny)	Accuracy39.6	65
Embodied Task Completion	EB-Habitat	Avg Success Rate20.7	63
Hallucination and Visual Reasoning Evaluation	HallusionBench	--	40
Robotic Manipulation	LIBERO Specialized Suites & Diverse Suite	Metric 90 Success Rate85	6
2D Spatial Reasoning	CV-Bench (full)	Accuracy78.2	5
3D Spatial Reasoning	CV-Bench 3D (full)	Accuracy82	5
Hallucination Reasoning	CounterCurate	Accuracy85.3	5
Hallucination Reasoning	SugarCrepe	Accuracy86.4	5
Spatial Reasoning	BLINK (full)	Accuracy56	5
High-level task planning and completion	EmbodiedBench (Alfred)	Base Score24.7	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord