Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

About

Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100% to around 10% or less.

Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate46.8	423
Jailbreak Defense	AdvBench	ASR (PAIR)1.7	115
Jailbreak Defense	HarmBench	PAIR ASR2.51	91
Jailbreak Attack	AdvBench 150 Harmful Behaviors	ASR0.00e+0	45
Jailbreak Defense Performance	Jailbreak Attack Dataset	DSR78.6	33
Defense against adaptive attacks	HarmBench	ASR5.6	28
Adversarial Attack Defense	GCG Individual	BAR99.3	18
Question Answering	MS MARCO randomly selected 150 data points	BAR99	14
Jailbreak Defense Efficiency	HarmBench	Additional Tokens1	8
Jailbreak Defense	AdvBench Qwen-7B-chat	Attack Success Rate (ASR)15.8	7

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord