Detecting Language Model Attacks with Perplexity

About

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Gabriel Alon, Michael Kamfonas• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	MT-Bench	--	287
Web Navigation and Shopping	Webshop	Score54.7	153
Mathematical Reasoning	GSM8K	EM87.8	123
Jailbreak Defense	JBB-Behaviors	ASR6.67	121
Jailbreak Defense	AdvBench	ASR (PAIR)12	115
Jailbreak Defense	Wild Jailbreak	ASR3.3	114
Jailbreak Defense	PAIR	ASR0.18	97
Embodied Task Completion	AlfWorld	Success Rate78.4	96
Jailbreak Defense	GCG	ASR0.00e+0	91
Jailbreak Defense	DeepInception	Harmful Score1.18	58

Showing 10 of 76 rows

...

Other info

Follow for update

@wizwand_team Discord