Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Jailbreaking Detection on Safety Evaluation Set Jailbreaking (held-out target labels)
Loading...
97.4
AUROC
Safety Monitor (In-model)
50.392
62.596
74.8
87.004
May 12, 2026
AUROC
Updated 21d ago
Evaluation Results
Method
Method
Links
AUROC
Safety Monitor (In-model)
Model=Qwen-2.5-7B-Inst...
2026.05
97.4
Quotient Transfer
Target Model=Qwen-3B,...
2026.05
93.2
Quotient Transfer
Target Model=Qwen-14B,...
2026.05
92.6
Quotient Transfer
Target Model=Qwen-2.5-...
2026.05
83.8
Quotient Transfer
Target Model=Mistral,...
2026.05
66.9
Random-initialization control
2026.05
52.2
Feedback
Search any
task
Search any
task