AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

About

The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents' tasks.

Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, Chaowei Xiao• 2025

Related benchmarks

Task	Dataset	Result
Hijacking	ACIARENA Code	UA44.17	27
Disruption	ACIARENA Code	UA20.22	27
Exfiltration	ACIARENA Code	UA37.78	27
Benign Utility	ACIARENA Code 1.0 (test)	BU54.44	21
Lifelong Safety Adaptation	Safety Benchmarks (Final-day)	Final-day Macro F1 Score85.7	15
Safety Compliance Evaluation	eICU-AC	LPA98.4	10
Safety Compliance Evaluation	Mind2Web SC	LPA94	10
Agent Defense	S2Bench	Query ASR0.377	10
Safety Detection	MAS Qwen-2.5-7B (Current)	Accuracy85.17	10
Safety Detection	MAS LLaMA-3.1-8B (Current)	Accuracy82.09	10

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord