Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Exploring Reasoning Reward Model for Agents

About

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue• 2026

Related benchmarks

TaskDatasetResultRank
General AI AssistantGAIA text
GAIA Average Score43.7
19
Web Browsing and NavigationWebWalkerQA
Average Accuracy46.2
18
High-Level ReasoningHLE
Average Score10.8
17
Web Searchxbench
Average Score43
15
General AI Assistant ReasoningGAIA text
Pass@143.7
3
Multi-modal Agent ReasoningGAIA (full)
Pass@138.8
3
Showing 6 of 6 rows

Other info

GitHub

Follow for update