Single-pass Detection of Jailbreaking Input in Large Language Models

About

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Detection	XST	Accuracy70	13
Jailbreak Detection	AEG2	Accuracy61.17	13
Jailbreak Detection	EJ-OO	Accuracy68.33	13
Jailbreak Detection	FQ-PH	ACC63.46	13
Jailbreak Detection	WJB	ACC52.17	13
Jailbreak Detection	ALL-4	Accuracy54.64	13
Jailbreak Detection	WGT	Accuracy61.57	13
Jailbreak Detection	L3J	Accuracy65.79	13
Jailbreak Detection	ADVB	Accuracy74.23	13
Jailbreak Detection	HB	Correctness Rate (COR)68	13

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord