Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Hinting Language Models Enhance Reinforcement Learning

About

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy49.1
214
Mathematical ReasoningIMO-Bench
Accuracy32.5
57
Mathematical ReasoningAIME 2026
AIME 2026 Accuracy57.8
55
Math ReasoningAMC23--
51
Mathematical ReasoningIn-Distribution Avg
Average Score42.3
29
Math ReasoningAIME 24
Average@1616
26
Math ReasoningAIME25
Average@1612.5
26
Math ReasoningOlympiadBench
Pass@1645.9
18
Mathematical ReasoningMathematical Reasoning Suite Overall
Average Score48.2
16
Mathematical ReasoningHMMT 2026
Accuracy29.9
16
Showing 10 of 17 rows

Other info

GitHub

Follow for update