Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

About

Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the original input instructions and executing maliciously injected instructions within data content, such as web documents retrieved from search engines. Existing defense methods, including prompt-engineering and fine-tuning approaches, typically instruct models to follow the original input instructions while suppressing their tendencies to execute injected instructions. However, our experiments reveal that suppressing instruction-following tendencies is challenging. Through analyzing failure cases, we observe that although LLMs tend to respond to any recognized instructions, they are aware of which specific instructions they are executing and can correctly reference them within the original prompt. Motivated by these findings, we propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs. Our approach prompts LLMs to generate responses that include both answers and their corresponding instruction references. Based on these references, we filter out answers not associated with the original input instructions. Comprehensive experiments demonstrate that our method outperforms prompt-engineering baselines and achieves performance comparable to fine-tuning methods, reducing the attack success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has minimal impact on overall utility.

Yulin Chen, Haoran Li, Yuan Sui, Yue Liu, Yufei He, Xiaoling Bai, Chi Fei, Yabo Li, Haozhe Ma, Yangqiu Song, Bryan Hooi• 2025

Related benchmarks

TaskDatasetResultRank
Prompt Injection DefenseInj-SQuAD
Combined ASR0.11
123
Prompt Injection PreventionAlpaca-Farm
ASR0.00e+0
105
Prompt Injection AttackDirect Scenario
ASR3.85
28
Question AnsweringTriviaQA
Accuracy79.44
26
Indirect Prompt Injection DefenseInj-TriviaQA
Naive ASR0.11
21
Prompt InjectionSQuAD Inj
ASR (Naive)1.11
18
Showing 6 of 6 rows

Other info

Follow for update