SecAlign: Defending Against Prompt Injection with Preference Optimization

About

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo• 2024

Related benchmarks

Task	Dataset	Result
Prompt Injection Prevention	Alpaca-Farm	--	105
Prompt Injection Prevention	NQ simplified	Naïve Success Rate3	24
Agentic Security and Utility Evaluation	AgentDojo	ASR2	22
Dynamic Agent Security and Utility Evaluation	AgentDyn	ASR9	22
Poisoning Defense Evaluation	Target-Injected Tasks 7x7 UCC poisoned	Delta Attack Effectiveness (%)0.69	10
Indirect Prompt Injection Defense	Image Modality (test)	UIAinject41.1	10
Indirect Prompt Injection Defense	Video Modality (test)	UIAinject35.9	10
Indirect Prompt Injection Defense	Audio Modality (test)	UIAinject55.8	9
Prompt Injection Defense	Qwen2.5-VL-7B Video Evaluation Set	UIAinject44.3	7
Prompt Injection Defense	InternVL Image Evaluation Set 3.5-8B	UIAinject57.8	7

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord