Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

About

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu• 2026

Related benchmarks

Task	Dataset	Result
Deep search	GAIA	Accuracy70.9	59
Deep search	BrowseComp-ZH	Accuracy45.7	35
Deep search	Browsecomp	Accuracy42.8	24
Deep search	xBench-DS	Accuracy75	16

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord