To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization
About
Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy89.26 | 751 | |
| Mathematical Reasoning | MATH500 (test) | Accuracy71.4 | 381 | |
| Mathematical Reasoning | AIME 2024 (test) | Accuracy22.6 | 103 | |
| Mathematical Reasoning | GaoKao (test) | Accuracy51.69 | 61 | |
| Mathematical Reasoning | AMC 2023 (test) | Accuracy45.18 | 27 | |
| Mathematical Reasoning | Olympiad (test) | Accuracy32.6 | 14 |