Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
About
Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval (test) | Pass@176.83 | 444 | |
| Scientific Reasoning | GPQA Diamond (test) | Accuracy38.26 | 32 | |
| Mathematical Reasoning | OlympiadBench | Avg Success Rate @844.25 | 23 | |
| Mathematical Reasoning | AIME 24 | Avg@819.17 | 14 | |
| Mathematical Reasoning | AMC 23 | Avg@852.5 | 14 | |
| Mathematical Reasoning | MATH 500 | Avg@8 Score80.73 | 14 | |
| Mathematical Reasoning | Omni-MATH | Average Score @825.34 | 14 | |
| Mathematical Reasoning | Average Reasoning Benchmarks | Avg@840.91 | 14 | |
| Mathematical Reasoning | Minerva Math | Avg@823.48 | 14 | |
| General Knowledge Reasoning | MMLU Pro (test) | Accuracy37.72 | 10 |