DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
About
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where an attacker injects prompts into their inputs to induce attacker-desired outputs. A detection method aims to determine whether a given input is contaminated by an injected prompt. However, existing detection methods have limited effectiveness against state-of-the-art attacks, let alone adaptive ones. In this work, we propose DataSentinel, a game-theoretic method to detect prompt injection attacks. Specifically, DataSentinel fine-tunes an LLM to detect inputs contaminated with injected prompts that are strategically adapted to evade detection. We formulate this as a minimax optimization problem, with the objective of fine-tuning the LLM to detect strong adaptive attacks. Furthermore, we propose a gradient-based method to solve the minimax optimization problem by alternating between the inner max and outer min problems. Our evaluation results on multiple benchmark datasets and LLMs show that DataSentinel effectively detects both existing and adaptive prompt injection attacks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | SQuAD v2 | ASR Score0.78 | 36 | |
| Question Answering | Dolly Closed QA | ASR84 | 36 | |
| Indirect Prompt Injection Defense Evaluation | AgentDojo TOOLKNOWLEDGE attack suite | Latency (s)17.44 | 24 | |
| Adversarial Robustness against Indirect Prompt Injection | AgentDojo Average across attacks | UA32.06 | 22 | |
| LLM Agent Task Completion | AgentDojo No Attack | Benign Utility53.26 | 22 | |
| Adversarial Robustness against Indirect Prompt Injection | AgentDojo ImportantMsgs | Utility (UA)33.79 | 22 | |
| Adversarial Robustness against Indirect Prompt Injection | AgentDojo ToolKnowledge | Utility Score37.41 | 22 | |
| Adversarial Robustness against Indirect Prompt Injection | AgentDojo IgnorePrevious | Utility (UA)42.52 | 22 | |
| Adversarial Robustness against Indirect Prompt Injection | AgentDojo Combined | UA35.03 | 22 | |
| Prompt Injection Defense | WASP | Attack Success Rate (ASR)10 | 16 |