Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning
About
Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Interactive Decision-making | ScienceWorld Seen | Success Rate83.16 | 72 | |
| Interactive Decision-making | WebShop (test) | Success Rate97 | 37 | |
| Interactive Decision-making | ScienceWorld Unseen | Success Rate85.15 | 32 | |
| Interactive Decision-making | ALFWorld Seen | Success Rate87.9 | 32 | |
| Interactive Decision-making | ALFWorld Unseen | Success Rate89.6 | 32 | |
| Interactive Environment Task Completion | ALFWorld Unseen | Average Reward89.6 | 31 | |
| Interactive Environment Task Completion | ALFWorld Seen | Average Reward87.9 | 31 | |
| E-commerce Agent Interaction | Webshop | Average Reward67.45 | 12 | |
| Interactive Agent Task | ScienceWorld Seen | Average Reward77.1 | 9 | |
| Interactive Agent Task | ScienceWorld Unseen | Average Reward75.67 | 9 |