Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

About

Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao, Sam Tak Wu Kwong, Yuguang Fang• 2026

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	2WikiMultihopQA	EM41.2	559
Question Answering	PopQA	EM44.5	27
Search-augmented multi-turn question answering	TriviaQA	Exact Match (EM) Accuracy64.6	10
Search-augmented multi-turn question answering	Natural Questions (NQ)	Exact Match (EM) Accuracy28.1	10
Search-augmented multi-turn question answering	Bamboogle	Exact Match (EM)48.8	10
Search-augmented multi-turn question answering	HotpotQA	Exact Match (EM)34.8	10
Search-augmented multi-turn question answering	MuSiQue	Exact Match (EM)10.7	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord