Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

About

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.

Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang• 2026

Related benchmarks

TaskDatasetResultRank
Knowledge-intensive reasoningMuSiQue
F1 Score33.1
43
Knowledge-intensive reasoningHotpotQA
F1 Score0.624
41
Knowledge-intensive reasoningBamboogle
F161.7
23
Knowledge-intensive reasoning2WikiMultihopQA
F1 Score58.6
15
Knowledge-intensive reasoningBamboogle
F1 Score60.4
15
Mathematical ReasoningMATH500
Pass@178.8
15
Mathematical ReasoningGSM8K
Pass@192
15
Mathematical ReasoningMATH
Pass@1 Accuracy90
15
Mathematical ReasoningAIME24, AIME25, MATH500, GSM8K, MATH Aggregated
Pass@163.5
15
Mathematical ReasoningAIME 2024
Pass@1 Accuracy30
15
Showing 10 of 17 rows

Other info

Follow for update