Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

About

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.

Shichao Ma, Zhiyuan Ma, Ming Yang, Xiaofan Li, Xing Wu, Jintao Du, Yu Cheng, Weiqiang Wang, Qiliang Liu, Zhengyang Zhou, Yang Wang• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA
EM40.7
278
Multi-hop Question AnsweringHotpotQA--
221
Question AnsweringPopQA--
186
Multi-hop Question AnsweringMuSiQue
EM15.1
106
Multi-hop Question AnsweringBamboogle
Exact Match41.6
97
Question AnsweringNQ
EM52.7
57
General Question AnsweringTriviaQA
Exact Match68.7
39
Showing 7 of 7 rows

Other info

Follow for update