RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
About
Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Online Shopping | Webshop | Score84.4 | 61 | |
| Interactive Task Completion | AlfWorld | Pick Success Rate100 | 45 | |
| Visual Agentic Reasoning | Sokoban | Success Rate62.4 | 27 | |
| Question Answering | DeepResearch | HotpotQA Score44.7 | 12 |