Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

About

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng• 2026

Related benchmarks

TaskDatasetResultRank
Online ShoppingWebshop
Score84.4
61
Interactive Task CompletionAlfWorld
Pick Success Rate100
45
Visual Agentic ReasoningSokoban
Success Rate62.4
27
Question AnsweringDeepResearch
HotpotQA Score44.7
12
Showing 4 of 4 rows

Other info

Follow for update