AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards
About
While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, na\"ively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by $16.0\%$ in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tool Use | BFCL Multi-turn | Accuracy52.12 | 24 | |
| Tool Use | BFCL Single-Turn | OA84.11 | 10 | |
| General Reasoning | MMLU Pro OOD official (test) | Overall Accuracy73.43 | 6 |