TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
About
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultihopQA | EM40.7 | 278 | |
| Multi-hop Question Answering | HotpotQA | -- | 221 | |
| Question Answering | PopQA | -- | 186 | |
| Multi-hop Question Answering | MuSiQue | EM15.1 | 106 | |
| Multi-hop Question Answering | Bamboogle | Exact Match41.6 | 97 | |
| Question Answering | NQ | EM52.7 | 57 | |
| General Question Answering | TriviaQA | Exact Match68.7 | 39 |