Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement

About

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringGeneral QA NQ, TriviaQA, PopQA (test)
Overall Average Score45.1
49
Multi-hop Question AnsweringMulti-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) (test)
HotpotQA Score0.466
44
Multi-hop Question AnsweringMulti-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle)
HotpotQA Score49.2
39
General Question AnsweringGeneral QA NQ, TriviaQA, PopQA
NQ Accuracy51.8
34
Showing 4 of 4 rows

Other info

Follow for update