Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

About

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.

Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang• 2026

Related benchmarks

TaskDatasetResultRank
Question Answering2Wiki
F150.64
152
Question AnsweringBamboogle
EM36.8
120
Question AnsweringMuSiQue
F1 Score26.58
70
Deep ResearchBrowseComp+
Accuracy9.4
38
Question AnsweringPopQA
F1 Score49.26
30
Question AnsweringNQ
F153.22
24
Question AnsweringMuSiQue
Exact Match (EM)17.05
10
Question AnsweringTriviaQA
F1 Score72.17
10
Question AnsweringNQ
F140.38
9
Question AnsweringHotpotQA
F1 Score45.98
9
Showing 10 of 12 rows

Other info

Follow for update