Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Hinting Language Models Enhance Reinforcement Learning

About

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $\tau$ conditioned on $(x,h)$. Crucially, the task reward $R(x,\tau)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian• 2026

Related benchmarks

TaskDatasetResultRank
Math ReasoningAMC23--
51
Mathematical ReasoningIn-Distribution Avg
Average Score42.3
29
Math ReasoningAIME 24
Average@1616
26
Math ReasoningAIME25
Average@1612.5
26
Math ReasoningOlympiadBench
Pass@1645.9
18
Generalization ReasoningMMLU-Pro
Average@16 Accuracy59.3
14
Generalization ReasoningGPQA Diamond
Average@16 Accuracy38
14
Generalization ReasoningOut-of-distribution Aggregate
Average Accuracy48.6
14
Math ReasoningMATH 500
Average Accuracy@1680
14
Math ReasoningMinerva Math
Average Accuracy @1639.3
14
Showing 10 of 10 rows

Other info

GitHub

Follow for update