Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

About

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu• 2026

Related benchmarks

TaskDatasetResultRank
Deep searchGAIA
Accuracy70.9
59
Deep searchBrowseComp-ZH
Accuracy45.7
35
Deep searchBrowsecomp
Accuracy42.8
24
Deep searchxBench-DS
Accuracy75
16
Showing 4 of 4 rows

Other info

Follow for update