Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Diversity-Incentivized Exploration for Versatile Reasoning

About

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.

Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)--
895
Mathematical ReasoningAIME 2024
Accuracy42.3
479
Mathematical ReasoningAIME 2024
Accuracy51.1
370
Mathematical ReasoningAIME 2025
Accuracy44.2
311
Science Question AnsweringARC-C
Accuracy88.4
261
Science ReasoningGPQA
Accuracy59.1
243
Mathematical ReasoningMinerva Math
Accuracy36.8
233
Commonsense ReasoningARC-C
Accuracy91.1
215
Mathematical ReasoningIn-Distribution Reasoning Performance Suite (AIME, AMC, MATH-500, Minerva, Olympiad)
AIME 2024 Score23.8
112
Question AnsweringMMLU-Pro
Accuracy62
91
Showing 10 of 28 rows

Other info

Follow for update