Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

About

Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao, Sam Tak Wu Kwong, Yuguang Fang• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA
EM41.2
559
Question AnsweringPopQA
EM44.5
27
Search-augmented multi-turn question answeringTriviaQA
Exact Match (EM) Accuracy64.6
10
Search-augmented multi-turn question answeringNatural Questions (NQ)
Exact Match (EM) Accuracy28.1
10
Search-augmented multi-turn question answeringBamboogle
Exact Match (EM)48.8
10
Search-augmented multi-turn question answeringHotpotQA
Exact Match (EM)34.8
10
Search-augmented multi-turn question answeringMuSiQue
Exact Match (EM)10.7
10
Showing 7 of 7 rows

Other info

Follow for update