ToRL: Scaling Tool-Integrated RL
About
We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3\% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14\% and the best existing Tool-Integrated Reasoning (TIR) model by 17\%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy87.8 | 535 | |
| Mathematical Reasoning | AIME 25 | Accuracy27.9 | 201 | |
| Mathematical Reasoning | AMC 23 | Accuracy45 | 198 | |
| Mathematical Reasoning | AIME24 | Accuracy74 | 130 | |
| Mathematical Reasoning | GSM8K | -- | 102 | |
| Mathematical Reasoning | AIME 24 | AIME 24 Accuracy23.33 | 84 | |
| Scientific Question Answering | GPQA Diamond | Accuracy51.5 | 64 | |
| Expert-Level Question Answering | GPQA Diamond | Pass@165.15 | 39 | |
| Knowledge-intensive reasoning | MuSiQue | Accuracy72 | 31 | |
| Function Calling | BFCL (Berkeley Function Calling Leaderboard) | Base Score0.00e+0 | 28 |