Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

About

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy Optimization (HAPO). HAPO employs the Synthetic Success Injection (SSI) operator, a hindsight mechanism that selectively anchors optimization to teacher demonstrations during failure. This injection is governed by a Thompson sampling-inspired gating mechanism, creating an autonomous, self-paced curriculum. Theoretically, we demonstrate that HAPO achieves \textit{asymptotic consistency}: by naturally annealing the teacher signal as the policy improves, HAPO recovers the unbiased on-policy gradient. This ensures off-policy guidance acts as a temporary scaffold rather than a persistent ceiling, enabling the model to surpass the limitations of static teacher forcing.

Yuning Wu, Ke Wang, Devin Chen, Kai Wei• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	--	922
Mathematical Reasoning	AIME 2024 (test)	--	294
Mathematical Reasoning	OlympiadBench (test)	@1 Success Rate51.4	15

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord