Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Tree Search for LLM Agent Reinforcement Learning

About

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringHotpotQA (test)
F118.6
198
Multi-hop Question Answering2WikiMQA
F1 Score68.7
154
Multi-hop Question Answering2WikiMultiHopQA (test)
EM26.8
143
Multi-hop Question AnsweringMuSiQue (test)
F17.2
111
Multi-hop Question AnsweringMuSiQue--
106
Single-hop Question AnsweringTriviaQA
EM57.81
62
Single-hop Question AnsweringPopQA
EM44.14
55
Question AnsweringGeneral QA NQ, TriviaQA, PopQA (test)
Overall Average Score42.4
49
Multi-hop Question AnsweringBamboogle (test)
EM13.6
46
Multi-hop Question AnsweringMulti-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) (test)
HotpotQA Score0.424
44
Showing 10 of 21 rows

Other info

Follow for update