Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LlamaFirewall: An open source guardrail system for building secure AI agents

About

Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe• 2025

Related benchmarks

TaskDatasetResultRank
Agent Safety EvaluationAgent-SafetyBench aggregated clean and five attack types
UBR39.73
30
Indirect Prompt Injection Defense EvaluationAgentDojo TOOLKNOWLEDGE attack suite
Latency (s)10.08
24
Adversarial Robustness against Indirect Prompt InjectionAgentDojo Average across attacks
UA34.58
22
Adversarial Robustness against Indirect Prompt InjectionAgentDojo Combined
UA44.56
22
Adversarial Robustness against Indirect Prompt InjectionAgentDojo ImportantMsgs
Utility (UA)39.57
22
Adversarial Robustness against Indirect Prompt InjectionAgentDojo ToolKnowledge
Utility Score39.23
22
Adversarial Robustness against Indirect Prompt InjectionAgentDojo IgnorePrevious
Utility (UA)43.88
22
LLM Agent Task CompletionAgentDojo No Attack
Benign Utility45.65
22
Coding CFH (reverse shell) attackCFH Hard Coding
Generation Success Rate90
8
Multi-turn Safety Risk AssessmentFilesystem
ASR92
8
Showing 10 of 22 rows

Other info

Follow for update