Formalizing and Benchmarking Prompt Injection Attacks and Defenses
About
A prompt injection attack aims to inject malicious instruction/data into the input of an LLM-Integrated Application such that it produces results as an attacker desires. Existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a framework to formalize prompt injection attacks. Existing attacks are special cases in our framework. Moreover, based on our framework, we design a new attack by combining existing ones. Using our framework, we conduct a systematic evaluation on 5 prompt injection attacks and 10 defenses with 10 LLMs and 7 tasks. Our work provides a common benchmark for quantitatively evaluating future prompt injection attacks and defenses. To facilitate research on this topic, we make our platform public at https://github.com/liu00222/Open-Prompt-Injection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Indirect Prompt Injection Attack | IPI-3k | ASR37.4 | 90 | |
| Indirect Prompt Injection | Amazon Reviews | ASR16.4 | 47 | |
| Indirect Prompt Injection | HotpotQA | ASR84.4 | 42 | |
| Indirect Prompt Injection | Multi-News | ASR87.5 | 42 | |
| Toxic Comment Detection | Toxic Comment | ASR18.1 | 14 | |
| Negative Review Detection | Negative Review | ASR11 | 14 | |
| Spam Email Detection | Spam Email | ASR37.9 | 14 | |
| Prompt Injection Attack | NavGPT (test) | Navigation Error7.51 | 12 | |
| Negative Review Classification | Negative Review | Tokens Used12.1 | 10 | |
| Prompt Injection | Negative Review | ASR (None Defense)0.3 | 10 |