GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

About

Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for detecting jailbreak prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects jailbreak prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our method is grounded in a pivotal observation: the gradients of an LLM's loss for jailbreak prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect jailbreak prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting jailbreak prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on ToxicChat and XSTest. The source code is available at https://github.com/xyq7/GradSafe.

Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong• 2024

Related benchmarks

Task	Dataset	Result
Safety Evaluation	HarmBench	ASR7.5	148
Safety Evaluation	DirectHarm 4	Attack Success Rate9	87
Safety Evaluation	HEX-PHI	Attack Success Rate (ASR)5.88	87
Jailbreak attack success rate	HarmBench	Attack Success Rate (Generated)57	52
Attack Success Rate	DirectHarm4	Attack Success Rate65	48
Attack Success Rate	HEX-PHI	Attack Success Rate13.45	48
Safety Evaluation	HarmBench	ASR17.5	39
Jailbreak Detection	Average of six attacks	Avg Success Rate0.00e+0	38
Adversarial and Jailbreaking Attack Detection	XSTest	AUROC0.984	35
Safety Guardrailing	HumanEval	False Positive Rate0.00e+0	32

Showing 10 of 63 rows

Other info

Code

Follow for update

@wizwand_team Discord