Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

About

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi\'c, Luca Beurer-Kellner, Marc Fischer, Florian Tram\`er• 2024

Related benchmarks

TaskDatasetResultRank
Utility assessmentAgentDojo (test)
Utility100
128
Adversarial Attack Success Rate AssessmentAgentDojo
ASR5.84
56
Agent Task PerformanceAgentDojo Travel
Attack Success Rate5.71
24
Indirect Prompt Injection DefenseTrojanTools
ASR28.9
18
Agent Task PerformanceAgentDojo Banking
Attack Success Rate11.11
18
Indirect Prompt Injection DefenseIgnore Instruction
ASR10.3
18
Indirect Prompt Injection DefenseCombined Attacks
ASR13.2
18
Agent PlanningAgentDojo
TCR @ ∞72.3
16
Tool-use agent security evaluationSIREN
Explicit Directive (UA)4.02
16
Indirect Prompt Injection DefenseVision-Language Agentic IPI Benchmark (test)
BU72
12
Showing 10 of 24 rows

Other info

Code

Follow for update