On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement

About

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li• 2025

Related benchmarks

Task	Dataset	Result
Single-hop Question Answering	PopQA	--	186
Single-hop Question Answering	TriviaQA	--	133
Multi-hop Question Answering	Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) (test)	HotpotQA Score0.466	60
Multi-hop Question Answering	Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle)	HotpotQA Score49.2	54
Question Answering	General QA NQ, TriviaQA, PopQA (test)	Overall Average Score45.1	49
General Question Answering	General QA NQ, TriviaQA, PopQA	NQ Accuracy51.8	40
Multi-hop Question Answering	HotpotQA	F1 Score60.63	14
Multi-hop Question Answering	2WikiMultihopQA	F1 Score56.17	14
Multi-hop Question Answering	MuSiQue	F1 Score31.57	14
Multi-hop Question Answering	Bamboogle	F1 Score58.36	14

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord