Self-Hinting Language Models Enhance Reinforcement Learning

About

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2025	Accuracy49.1	214
Mathematical Reasoning	IMO-Bench	Accuracy32.5	57
Mathematical Reasoning	AIME 2026	AIME 2026 Accuracy57.8	55
Math Reasoning	AMC23	--	51
Mathematical Reasoning	In-Distribution Avg	Average Score42.3	29
Math Reasoning	AIME 24	Average@1616	26
Math Reasoning	AIME25	Average@1612.5	26
Math Reasoning	OlympiadBench	Pass@1645.9	18
Mathematical Reasoning	Mathematical Reasoning Suite Overall	Average Score48.2	16
Mathematical Reasoning	HMMT 2026	Accuracy29.9	16

Showing 10 of 17 rows

Other info

GitHub

Follow for update

@wizwand_team Discord