Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

About

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti• 2025

Related benchmarks

TaskDatasetResultRank
Interactive Decision-makingScienceWorld Seen
Success Rate83.16
72
Interactive Decision-makingWebShop (test)
Success Rate97
37
Interactive Decision-makingScienceWorld Unseen
Success Rate85.15
32
Interactive Decision-makingALFWorld Seen
Success Rate87.9
32
Interactive Decision-makingALFWorld Unseen
Success Rate89.6
32
Interactive Environment Task CompletionALFWorld Unseen
Average Reward89.6
31
Interactive Environment Task CompletionALFWorld Seen
Average Reward87.9
31
E-commerce Agent InteractionWebshop
Average Reward67.45
12
Interactive Agent TaskScienceWorld Seen
Average Reward77.1
9
Interactive Agent TaskScienceWorld Unseen
Average Reward75.67
9
Showing 10 of 13 rows

Other info

Follow for update