| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Jailbreak Robustness | WildJailbreak | Unsafe Rate0 | 144 | |
| Safety Evaluation | WildJailbreak | ASR0.068 | 70 | |
| Safety Evaluation | WildJailbreak (held-out) | Attack Success Rate (ASR)0 | 50 | |
| Over-refusal | Wildjailbreak (Benign) | Wildjailbreak Benign Refusal Rate1.43 | 49 | |
| Safety Alignment | WildJailbreak | Trainable parameters (M)15,768.31 | 44 | |
| Jailbreak | WildJailBreak (WJB) (test) | ASR@12 | 33 | |
| Safety Alignment | WildJailbreak | Safe@177.2 | 24 | |
| Jailbreak Evaluation | WildJailbreak | Performance Rate98.5 | 22 | |
| Safety | WildJailbreak | Harmful Response Ratio17.6 | 21 | |
| Safety Moderation | WILDJAILBREAK (val) | ASR0.7 | 18 | |
| Jailbreak Safety | WildJailbreak | Reasoning Harmful Ratio17.3 | 17 | |
| Jailbreak Detection | Wildjailbreak | F1 Score96 | 15 | |
| Safety Alignment Evaluation | WildJailbreak (WildJB) | Safety Rate98.6 | 14 | |
| Text Moderation | WildJailbreak Adv. Benign n = 210 | Flagged Count3 | 13 | |
| Text Moderation | WildJailbreak Adv. Harmful n = 2,000 | Flagged Count278 | 13 | |
| Adversarial Robustness | WildJailbreak 2k queries (test) | Number of Explicit Refusals498 | 12 | |
| Attacking Cascade System Decision Maker | WildJailbreak | Performance98.5 | 11 | |
| Jailbreaking | WildJailbreak (WJB) | ASR@1 (Qwen2.5-7B-IT)89.5 | 11 | |
| Jailbreak Attack | WildJailbreak | ASR0.8 | 11 | |
| Safety Performance | WildJailbreak | Selective Refusal Score (Δs)90.1 | 11 | |
| Benign Compliance | WildJailBreak (WJB) Vanilla Benign | Compliance Rate100 | 9 | |
| Refusal Evaluation | WildJailbreak Adversarial Harmful | Refusal Rate89.45 | 7 | |
| Jailbreaking Attack Detection | WildJailbreak | Accuracy (MCA)36 | 6 | |
| Safety Detection | WildJailbreak (held-out) | AUROC99 | 5 | |
| Jailbreak | WildJailbreak Forbidden Questions (Overall) | ASR92.1 | 2 |