Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement

About

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li• 2025

Related benchmarks

TaskDatasetResultRank
Single-hop Question AnsweringPopQA--
186
Single-hop Question AnsweringTriviaQA--
133
Multi-hop Question AnsweringMulti-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) (test)
HotpotQA Score0.466
60
Multi-hop Question AnsweringMulti-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle)
HotpotQA Score49.2
54
Question AnsweringGeneral QA NQ, TriviaQA, PopQA (test)
Overall Average Score45.1
49
General Question AnsweringGeneral QA NQ, TriviaQA, PopQA
NQ Accuracy51.8
40
Multi-hop Question AnsweringHotpotQA
F1 Score60.63
14
Multi-hop Question Answering2WikiMultihopQA
F1 Score56.17
14
Multi-hop Question AnsweringMuSiQue
F1 Score31.57
14
Multi-hop Question AnsweringBamboogle
F1 Score58.36
14
Showing 10 of 10 rows

Other info

Follow for update