JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

About

The systems and software powered by Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities. JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.

Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, Chao Shen• 2023

Related benchmarks

Task	Dataset	Result
Safety Evaluation	MM-Safety	ASR20.83	57
Toxicity Defense	MiniGPT-4	Toxicity Score16.51	36
Toxicity Defense	Qwen2-VL	Toxicity Score24.68	36
Jailbreak Defense	Qwen2-VL	ASR15.45	36
Jailbreak Defense	LLaVA v1.5	ASR17.27	36
Jailbreak Defense	MiniGPT-4	Attack Success Rate (ASR)27.27	36
Toxicity Defense	LLaVA v1.5	Toxicity Score73.76	36
Adversarial Detection	ImageNet BLIP-2	Detection Rate89	33
Adversarial Detection	ImageNet BLIP	Detection Rate84	24
Adversarial Detection	ImageNet Img2Prompt	Detection Rate83	23

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord