Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning with Exploration: An Entropy Perspective

About

Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing large language model (LLM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LLMs. Through empirical analysis, we uncover positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LLMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LLM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LLM reasoning.

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, Furu Wei• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
895
Mathematical ReasoningMATH 500
Accuracy (Acc)88.35
543
Mathematical ReasoningGSM8K--
499
Mathematical ReasoningAIME 2024
Accuracy46.35
479
Mathematical ReasoningMATH 500
Top-1 Accuracy83.4
384
Mathematical ReasoningAIME 2024
Accuracy51.6
370
Mathematical ReasoningAMC
Accuracy (%)87.04
368
Mathematical ReasoningAIME 2025
Accuracy39.22
311
Mathematical ReasoningMinerva
Pass@1 Accuracy48.52
289
Mathematical ReasoningAIME 2024
Accuracy51.6
220
Showing 10 of 132 rows
...

Other info

Follow for update