GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

About

The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li• 2024

Related benchmarks

Task	Dataset	Result
Agent Behavioral Safety and Helpfulness Evaluation	ToolEmu	Safety Rate96.2	42
Malicious behavior measurement	AgentHarm Harmful	Harm Rate2.8	33
LLM Agent Utility	AgentHarm Benign Requests	Utility Score35.1	23
Agent behavioral safety	AgentHarm	Safety Rate90.4	14
Agent behavioral safety	InjecAgent	Safety Rate94.3	14
Agent behavioral safety	AgentDojo	Safety Rate89.5	14
Safety and Utility Evaluation	FINVAULT	Approve Rate8.4	12
Safety Compliance Evaluation	Mind2Web SC	LPA85.7	10
Safety Compliance Evaluation	eICU-AC	LPA90	10
Agent Defense	S2Bench	Query ASR0.5	10

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord