On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement
About
Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | General QA NQ, TriviaQA, PopQA (test) | Overall Average Score45.1 | 49 | |
| Multi-hop Question Answering | Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) (test) | HotpotQA Score0.466 | 44 | |
| Multi-hop Question Answering | Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) | HotpotQA Score49.2 | 39 | |
| General Question Answering | General QA NQ, TriviaQA, PopQA | NQ Accuracy51.8 | 34 |