ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents

About

Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT -> GRPO -> DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47% improvement in task completion rate (62%->91%), a 63% reduction in tool invocation errors (38%->14%), and a 93% reduction in regulatory violations (12%->0.8%), within sub-2-second latency. Ablation studies show the multiplicative reward design accounts for 7 percentage points of improvement over additive alternatives. Generalization is further validated on ToolBench and API-Bank.

Pengbo Liu• 2026

Related benchmarks

Task	Dataset	Result
Tool Use	API-Bank (test)	Accuracy71.8	16
Tool Use	ToolBench standard evaluation	Pass Rate51.3	6
Tool-use alignment	FA-Bench 500 queries	TCR91	6
Financial Advisory Copilot	Online 80+ investment advisors (production)	Advisor Manual Retry Rate9	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord