Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

About

Aligned large language models (LLMs) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	--	1429
Instruction Following	AlpacaEval	Win Rate92.7	423
Jailbreak Defense	AdvBench	ASR (PAIR)0.00e+0	115
Jailbreak Defense	HarmBench	PAIR ASR1.9	91
Image Captioning	MS-COCO	CLIPScore0.861	36
Image Classification	ImageNet-D	Top-1 Accuracy65.7	36
Helpfulness evaluation	InstructionFollow	Accuracy58.7	32
Defense against adaptive attacks	HarmBench	ASR9.7	28
Jailbreak Defense	AdvBench LLaMA3-8B-instruct	Attack Success Rate (ASR)6.7	7
Jailbreak Defense	AdvBench Mistral-7B v0.2	ASR35.8	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord