RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

About

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng• 2026

Related benchmarks

Task	Dataset	Result
Online Shopping	Webshop	Score84.4	115
Interactive Task Completion	AlfWorld	Pick Success Rate100	72
Visual Agentic Reasoning	Sokoban	Success Rate62.4	27
Question Answering	DeepResearch	HotpotQA Score44.7	12

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord