TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
About
Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | 2Wiki | F150.64 | 152 | |
| Question Answering | Bamboogle | EM36.8 | 120 | |
| Question Answering | MuSiQue | F1 Score26.58 | 70 | |
| Deep Research | BrowseComp+ | Accuracy9.4 | 38 | |
| Question Answering | PopQA | F1 Score49.26 | 30 | |
| Question Answering | NQ | F153.22 | 24 | |
| Question Answering | MuSiQue | Exact Match (EM)17.05 | 10 | |
| Question Answering | TriviaQA | F1 Score72.17 | 10 | |
| Question Answering | NQ | F140.38 | 9 | |
| Question Answering | HotpotQA | F1 Score45.98 | 9 |