Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

About

The integration of large language models with external content has enabled applications such as Microsoft Copilot but also introduced vulnerabilities to indirect prompt injection attacks. In these attacks, malicious instructions embedded within external content can manipulate LLM outputs, causing deviations from user expectations. To address this critical yet under-explored issue, we introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Using BIPIA, we evaluate existing LLMs and find them universally vulnerable. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. Based on these findings, we propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings. Extensive experiments demonstrate that our black-box defense provides substantial mitigation, while our white-box defense reduces the attack success rate to near-zero levels, all while preserving the output quality of LLMs. We hope this work inspires further research into securing LLM applications and fostering their safe and reliable use.

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu• 2023

Related benchmarks

TaskDatasetResultRank
Prompt Injection DefenseInj-SQuAD
Combined ASR34.78
123
Prompt Injection PreventionAlpaca-Farm
ASR24.52
105
Defense against Indirect Prompt InjectionFiltered QA dataset
ASR (Naive)97.65
30
Direct Prompt InjectionAlpacaFarm (208 samples)
Naive Success Rate78.36
30
Prompt Injection AttackDirect Scenario
ASR27.88
28
Question AnsweringTriviaQA
Accuracy78.89
26
Indirect Prompt Injection DefenseInj-TriviaQA
Naive ASR23.11
21
Prompt InjectionSQuAD Inj
ASR (Naive)19.33
18
Prompt Injection DefensePrompt Injection Attacks (test)
Naive ASR11.05
16
Defending against gradient-based attacksLlama3 GCG Attack (test)
ASR24.51
10
Showing 10 of 12 rows

Other info

Follow for update