Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
About
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy20 | 479 | |
| Mathematical Reasoning | AIME 2025 | Accuracy20 | 311 | |
| Mathematical Reasoning | Minerva | Pass@1 Accuracy27.6 | 289 | |
| Mathematical Reasoning | MATH 500 | pass@193.2 | 239 | |
| Mathematical Reasoning | MATH 500 | Pass@1 Rate86.2 | 236 | |
| Mathematical Reasoning | OlympiadBench | Accuracy20.1 | 213 | |
| Mathematical Reasoning | AMC23 | PASS@1 Accuracy70 | 207 | |
| Mathematical Reasoning | AIME 24 | Pass@1 Accuracy30 | 128 | |
| Mathematical Reasoning | MATH 500 | Accuracy69.8 | 116 | |
| Mathematical Reasoning | AIME 24 | Pass@1 Accuracy26.7 | 103 |