Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

About

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 25	Accuracy17.7	112
Instruction Following	IFEval	Accuracy (IFEval)57.9	101
Code Generation	LiveCodeBench	Accuracy19.5	84
Science Reasoning	GPQA	Accuracy (GPQA)46.6	72
Mathematics	AIME 24	Avg@320.197	20
Mathematics	AIME 25	Avg@3215.3	20
Question Answering	GPQA	Strict Accuracy41.9	17
Comprehensive Evaluation	Overall Across Benchmarks	Avg@32 Accuracy44.3	16
Instruction	IFEval	Avg@32 Accuracy46.2	16
Mathematics	MATH 500	Accuracy (avg@32)83.3	16

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord