SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
About
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-hop Question Answering | TriviaQA | EM59 | 62 | |
| Single-hop Question Answering | PopQA | EM43.2 | 55 | |
| Multi-hop Question Answering | HotpotQA in-domain | EM46.3 | 20 | |
| Multi-hop Question Answering | Multi-Hop QA Average | EM0.3775 | 20 | |
| Question Answering | All QA Datasets Average | EM40.66 | 20 | |
| Single-hop Question Answering | NQ (Natural Questions) in-domain (test) | EM31.45 | 20 | |
| Multi-hop Reasoning | Musique, HotpotQA, 2Wiki, and Bamboogle 3-hop and above | EM31.98 | 3 |