TreeRPO: Tree Relative Policy Optimization

About

Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang• 2025

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	MuSiQue	--	209
Single-hop Question Answering	PopQA	--	186
Multi-hop Question Answering	2WikiMQA	F1 Score70.1	161
Single-hop Question Answering	TriviaQA	--	133
Mathematical Reasoning	AIME 24	Pass@1 Accuracy16.8	128
Multi-hop Question Answering	HotpotQA	F1 Score59.6	31
Multi-hop Question Answering	Bamboogle	F155.7	25
Question Answering	Knowledge-Intensive Question Answering Benchmarks Aggregate	F156.5	15
Mathematical Reasoning	MATH500	--	12
Mathematical Reasoning	AIME24, MATH500, and Minerva Aggregate	Macro Accuracy37.2	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord