Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL

About

Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.

Yixiao Zhou, Yang Li, Dongzhou Cheng, Hehe Fan, Yu Cheng• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval (test)
Pass@176.83
444
Scientific ReasoningGPQA Diamond (test)
Accuracy38.26
32
Mathematical ReasoningOlympiadBench
Avg Success Rate @844.25
23
Mathematical ReasoningAIME 24
Avg@819.17
14
Mathematical ReasoningAMC 23
Avg@852.5
14
Mathematical ReasoningMATH 500
Avg@8 Score80.73
14
Mathematical ReasoningOmni-MATH
Average Score @825.34
14
Mathematical ReasoningAverage Reasoning Benchmarks
Avg@840.91
14
Mathematical ReasoningMinerva Math
Avg@823.48
14
General Knowledge ReasoningMMLU Pro (test)
Accuracy37.72
10
Showing 10 of 10 rows

Other info

Follow for update