Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Single-pass Detection of Jailbreaking Input in Large Language Models

About

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak DetectionXST
Accuracy70
13
Jailbreak DetectionAEG2
Accuracy61.17
13
Jailbreak DetectionEJ-OO
Accuracy68.33
13
Jailbreak DetectionFQ-PH
ACC63.46
13
Jailbreak DetectionWJB
ACC52.17
13
Jailbreak DetectionALL-4
Accuracy54.64
13
Jailbreak DetectionWGT
Accuracy61.57
13
Jailbreak DetectionL3J
Accuracy65.79
13
Jailbreak DetectionADVB
Accuracy74.23
13
Jailbreak DetectionHB
Correctness Rate (COR)68
13
Showing 10 of 13 rows

Other info

Follow for update