Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

About

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao• 2026

Related benchmarks

TaskDatasetResultRank
Step-level tool invocation safety detectionAgentHarm Traj
Accuracy84.81
20
Step-level tool invocation safety detectionASB-Traj
Accuracy0.9497
20
Step-level tool invocation safety detectionAgentDojo-Traj
Accuracy91.72
20
Guarded Agent EvaluationASB latest (IPI)
ASR5.5
14
Guarded Agent EvaluationAgentHarm latest (full)
Refusal Rate97.16
14
Guarded Agent EvaluationAgentDojo full latest
ASR0.0768
14
Guarded Agent EvaluationASB latest (DPI)
ASR62.25
14
Safety DetectionTS-Bench AgentHarm-Traj (eval)
Latency (s/sample)1.36
4
Safety DetectionTS-Bench AgentDojo-Traj (eval)
Efficiency (s/sample)1.36
4
Showing 9 of 9 rows

Other info

Follow for update