ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
About
Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT -> GRPO -> DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47% improvement in task completion rate (62%->91%), a 63% reduction in tool invocation errors (38%->14%), and a 93% reduction in regulatory violations (12%->0.8%), within sub-2-second latency. Ablation studies show the multiplicative reward design accounts for 7 percentage points of improvement over additive alternatives. Generalization is further validated on ToolBench and API-Bank.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Tool Use | API-Bank (test) | Accuracy71.8 | 16 | |
| Tool Use | ToolBench standard evaluation | Pass Rate51.3 | 6 | |
| Tool-use alignment | FA-Bench 500 queries | TCR91 | 6 | |
| Financial Advisory Copilot | Online 80+ investment advisors (production) | Advisor Manual Retry Rate9 | 2 |