Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

About

Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.

Kshitij Mishra, Nils Lukas, Salem Lahlou• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
EM86.63
115
Mathematical ReasoningMATH
Pass@171.85
112
Scientific ReasoningGPQA
Pass@148.12
22
Mathematical ReasoningAIME
Pass@154.42
20
Mathematical ReasoningAIME
Pass@154.42
18
Mathematical ReasoningGSM8K
Pass@183.25
9
Mathematical ReasoningMATH
Pass@171.85
9
Mathematical ReasoningAIME 1983-2025 (combined)
Accuracy13.28
5
Graduate-Level ReasoningGPQA
Pass@148.12
5
Mathematical ReasoningGSM8K
ACC82.03
4
Showing 10 of 11 rows

Other info

Follow for update