BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

About

Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia• 2024

Related benchmarks

Task	Dataset	Result
Dialogue Generation	UltraChat	ASR Accuracy15.5	32
Preference Alignment	HH-RLHF	ASR14.1	32
Mathematical Reasoning	GSM8K	ASR13.5	32
Backdoor Defense	LLM Backdoor Defense (test)	ASR30	30
Text Generation	VPI Generation Tasks Llama3-8B Mistral-7B (test)	ASR28	16
Text Generation	AutoPoison Generation Llama3-8B Mistral-7B (test)	ASR11	16
Text Generation	DTBA Llama3-8B Mistral-7B (test)	ASR10.5	16
Classification	Emotion	ASR21.8	15
Jailbreak Attack	Harmbench Malicious (full)	Harmful Score4.38	14

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord