SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets

About

Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.

Kshitij Mishra, Nils Lukas, Salem Lahlou• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	EM86.63	123
Mathematical Reasoning	MATH	Pass@171.85	112
Mathematical Reasoning	AIME	Pass@154.42	44
Scientific Reasoning	GPQA	Pass@148.12	22
Mathematical Reasoning	AIME	Pass@154.42	18
Mathematical Reasoning	GSM8K	Pass@183.25	9
Mathematical Reasoning	MATH	Pass@171.85	9
Mathematical Reasoning	AIME 1983-2025 (combined)	Accuracy13.28	5
Graduate-Level Reasoning	GPQA	Pass@148.12	5
Mathematical Reasoning	GSM8K	ACC82.03	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord