| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)100 | 487 | |
| Safety Evaluation | Harmbench | Harmbench Score0.06 | 112 | |
| Safety Alignment | HarmBench | ASR0 | 88 | |
| Jailbreaking | HARMBENCH 159 standard behaviors (test) | ASR0 | 51 | |
| Jailbreak | HarmBench Standard Behaviours (200 examples) | ASR0 | 48 | |
| Safety Evaluation | HarmBench | ASR4.1 | 42 | |
| Refusal Ablation and Jailbreak Attack Success | HARMBENCH | Attack Success Rate (ASR)96.27 | 40 | |
| Controllability | HarmBench | HarmBench Score87.5 | 40 | |
| Transferable Adversarial Attack | HarmBench Classifier (test) | TASR@188.6 | 37 | |
| Safety Evaluation | HarmBench | MD95 | 32 | |
| Red-teaming Safety Evaluation | Harmbench | ASR0.3 | 32 | |
| Jailbreaking | HarmBench (test) | ASR (GPT-4o)97 | 27 | |
| Red Teaming Attack | HarmBench (test) | ZS31.18 | 27 | |
| Safety Moderation | Harmbench | F1 Score87.2 | 26 | |
| Safety Evaluation | HarmBench Contextual Trajectory Evaluation Multi-turn | SFR96.1 | 24 | |
| Adversarial Risk Estimation | HarmBench (test) | ASR@1000100 | 24 | |
| Response Harmfulness Detection | HarmBench | F1 Score87.61 | 23 | |
| Input Moderation | HarmBench (test) | F1 Score100 | 22 | |
| Harmfulness Evaluation | HarmBench | Harmful Response Ratio21.26 | 21 | |
| Adversarial and Jailbreaking Attack Detection | HarmBench | AUROC0.8887 | 20 | |
| Jailbreaking | HarmBench 51 (test) | ASR@5 (Standard)98.1 | 19 | |
| Safety Alignment | HarmBench | MD Score95 | 18 | |
| Jailbreak Attack Evaluation | HarmBench (400 random samples) | ASR0 | 18 | |
| Jailbreak Attack | HarmBench example-based Llama3 8B | Attack Success Rate5 | 17 | |
| Safety Evaluation | HarmBench translated (test) | Success Rate (EN)11 | 16 |