Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning

About

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.

Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy89.18
155
Mathematical ReasoningAMC
Accuracy72.74
151
Mathematical ReasoningAIME24
Accuracy38.54
130
Mathematical ReasoningOlympiad
Accuracy50.06
50
Showing 4 of 4 rows

Other info

Follow for update