Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReLaX: Reasoning with Latent Exploration for Large Reasoning Models

About

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often drives the policy toward over-determinism, resulting in ineffective exploration and premature policy convergence. While promoting token-level diversity has shown promise in mitigating entropy collapse, we argue that the latent dynamics underlying token generation encode a far richer computational structure for steering policy optimization toward a more effective exploration-exploitation tradeoff. To enable tractable analysis and intervention of the latent dynamics of LRMs, we leverage Koopman operator theory to obtain a linearized representation of their hidden state dynamics. This enables us to introduce Dynamic Spectral Dispersion (DSD), a new metric to quantify the heterogeneity of the model's latent dynamics, serving as a direct indicator of policy exploration. Building upon these foundations, we propose Reasoning with Latent eXploration (ReLaX), a framework that explicitly incorporates latent dynamics to regulate exploration and exploitation during policy optimization. Comprehensive experiments across a wide range of multimodal and text-only reasoning benchmarks show that ReLaX consistently incentivizes reasoning capability and outperforms existing token-level methods. Our project is available at https://github.com/ZhangShimin1/ReLaX.

Shimin Zhang, Xianwei Chen, Yufan Shen, Ziyuan Ye, Jibin Wu• 2025

Related benchmarks

TaskDatasetResultRank
Multimodal ReasoningMMMU (val)
Accuracy57.4
144
Multimodal ReasoningMMStar
Accuracy65.5
143
Multimodal ReasoningDynaMath
Accuracy55.9
58
Multimodal Mathematical ReasoningMathVista mini (test)
Overall Accuracy77.1
48
Multi-modal ReasoningMathVision (test)
Accuracy (%)30.2
45
Multi-modal ReasoningEMMA
Accuracy30.6
26
Multimodal ReasoningMathVerse (testmini)
Mean@1 Accuracy55.7
17
Showing 7 of 7 rows

Other info

Follow for update