SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
About
Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | EM86.63 | 115 | |
| Mathematical Reasoning | MATH | Pass@171.85 | 112 | |
| Scientific Reasoning | GPQA | Pass@148.12 | 22 | |
| Mathematical Reasoning | AIME | Pass@154.42 | 20 | |
| Mathematical Reasoning | AIME | Pass@154.42 | 18 | |
| Mathematical Reasoning | GSM8K | Pass@183.25 | 9 | |
| Mathematical Reasoning | MATH | Pass@171.85 | 9 | |
| Mathematical Reasoning | AIME 1983-2025 (combined) | Accuracy13.28 | 5 | |
| Graduate-Level Reasoning | GPQA | Pass@148.12 | 5 | |
| Mathematical Reasoning | GSM8K | ACC82.03 | 4 |