Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

About

Reinforcement Learning with Verifiable Rewards (RLVR) serves as a cornerstone technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, its training is often plagued by \emph{entropy collapse}, a rapid decline in policy entropy that limits exploration and undermines training effectiveness. While recent works attempt to mitigate this issue via several heuristic entropy interventions, the underlying mechanisms remain poorly understood. In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight analytical approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework to explain how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweights tokens based on theoretically-estimated entropy variations. Extensive experiments across six mathematical reasoning and three coding benchmarks demonstrate that STEER effectively mitigates entropy collapse and consistently outperforms state-of-the-art baselines.

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, Jiawei Chen• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy82.4	589
Mathematical Reasoning	AIME 2024	Accuracy36.9	394
Mathematical Reasoning	AIME 2025	Accuracy16.2	378
Mathematical Reasoning	Minerva	Accuracy (Acc)28.2	146
Mathematical Reasoning	Olympiad	Accuracy0.366	136
Mathematical Reasoning	OlympiadBench	Accuracy43.3	134
Mathematical Reasoning	Minerva Math	pass@1 Accuracy41.7	104
Mathematical Reasoning	AIME 24	Accuracy17.4	78
Mathematical Reasoning	AIME 25	Avg@3216.1	50
Code Generation	LCB v5	Accuracy31.8	45

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord