Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

About

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti• 2025

Related benchmarks

Task	Dataset	Result
Interactive Decision-making	ScienceWorld Seen	Success Rate83.16	72
Interactive Decision-making	WebShop (test)	Success Rate97	37
Interactive Decision-making	ScienceWorld Unseen	Success Rate85.15	32
Interactive Decision-making	ALFWorld Seen	Success Rate87.9	32
Interactive Decision-making	ALFWorld Unseen	Success Rate89.6	32
Interactive Environment Task Completion	ALFWorld Unseen	Average Reward89.6	31
Interactive Environment Task Completion	ALFWorld Seen	Average Reward87.9	31
E-commerce Agent Interaction	Webshop	Average Reward67.45	12
Interactive Agent Task	ScienceWorld Seen	Average Reward77.1	9
Interactive Agent Task	ScienceWorld Unseen	Average Reward75.67	9

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord