Training-Free Group Relative Policy Optimization

About

Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.

Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	Minerva	Accuracy (Acc)21.69	146
Reasoning	GSM8K	--	111
Reasoning	MATH 500	Accuracy (%)53	94
Reasoning	AIME 24	Accuracy on AIME 2480	65
Web navigation	WebArena	Overall Success Rate32.7	55
Software Engineering	SWE-bench verified (All)	Success Rate93.8	32
Retrieval-Augmented Question Answering	DeepSearch 2wiki	Success Rate (SR)68	23
Retrieval-Augmented Question Answering	DeepSearch TriviaQA	Success Rate (SR)76	23
Retrieval-Augmented Question Answering	DeepSearch HotpotQA	Success Rate46	23
Retrieval-Augmented Question Answering	DeepSearch Average	SR49	23

Showing 10 of 36 rows

Other info

GitHub

Follow for update

@wizwand_team Discord