| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)100 | 376 | |
| Safety Alignment | HarmBench | ASR0 | 88 | |
| Safety Evaluation | Harmbench | Harmbench Score0.06 | 76 | |
| Jailbreaking | HARMBENCH 159 standard behaviors (test) | ASR0 | 51 | |
| Jailbreak | HarmBench Standard Behaviours (200 examples) | ASR0 | 48 | |
| Controllability | HarmBench | HarmBench Score87.5 | 40 | |
| Transferable Adversarial Attack | HarmBench Classifier (test) | TASR@188.6 | 37 | |
| Red-teaming Safety Evaluation | Harmbench | ASR0.3 | 32 | |
| Adversarial Risk Estimation | HarmBench (test) | ASR@1000100 | 24 | |
| Response Harmfulness Detection | HarmBench | F1 Score87.61 | 23 | |
| Harmfulness Evaluation | HarmBench | Harmful Response Ratio21.26 | 21 | |
| Adversarial and Jailbreaking Attack Detection | HarmBench | AUROC0.8887 | 20 | |
| Jailbreaking | HarmBench 51 (test) | ASR@5 (Standard)98.1 | 19 | |
| Safety Alignment | HarmBench | MD Score95 | 18 | |
| Jailbreak Attack Evaluation | HarmBench (400 random samples) | ASR0 | 18 | |
| Coherence evaluation | HarmBench (test) | N-gram Repetition Rate0.22 | 16 | |
| Malicious Prompt Refusal | HarmBench | Refusal Rate96 | 15 | |
| Prompt Classification | HarmBench Text Prompt | F1 Score98.85 | 14 | |
| Jailbreak Attack | Harmbench Malicious (full) | Harmful Score1.13 | 14 | |
| Safety Classification | HarmBench | Recall100 | 14 | |
| Adversarial Robustness | HarmBench | DR56.25 | 12 | |
| Prompt-Response Safety Routing | HarmBench | Routing F155.92 | 10 | |
| Safety Classification | HarmBench (test) | F1 Score90.5 | 9 | |
| Safety Evaluation | HarmBench Hades | Safety Score (1-ASR)1 | 8 | |
| Safety Evaluation | HarmBench-QR | Safety Score (1-ASR)0.9975 | 8 |