| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Jailbreak Robustness | WildTeaming WJ | Evaluation Score (avg@4)95.1 | 18 | |
| Safety Evaluation | WildTeaming | H89 | 16 | |
| Safety and Helpfulness Evaluation | WildTeaming | Harm Rate0.3 | 15 | |
| Safety Evaluation | WildTeaming 500-example (test) | HarmR88.6 | 14 | |
| LLM Red-Teaming | WildTeaming Target: Mistral | ASR55.5 | 2 | |
| LLM Red-Teaming | WildTeaming Llama-3 | ASR25.4 | 2 | |
| LLM Red-teaming | WildTeaming Llama-3.3-70B target | Effectiveness Count3,711 | 2 | |
| LLM Red-teaming | WildTeaming Mistral-7B target | Effectiveness Query Count5,414 | 2 |