| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Evaluation | BeaverTails (test) | Harmful Score7.9 | 110 | |
| Harmfulness Detection | BeaverTails bottom 30% uncertainty slice (test) | AUROC85.1 | 70 | |
| Safe-or-harmful binary classification | BeaverTails | Accuracy84.6 | 63 | |
| Multi-risk safety monitoring | BeaverTails | Accuracy (%)80.1 | 63 | |
| Harmful question-answering | BeaverTails HarmfulQA (1k and 10k samples) | Avg Harmfulness Score0 | 63 | |
| Response Harmfulness Detection | BeaverTails | F1 Score89.9 | 59 | |
| Harmful score evaluation | BeaverTails (test) | Harmful Score28.4 | 52 | |
| Text-based safety moderation | BeaverTails | F1 Score87.3 | 46 | |
| Malicious Fine-tuning Defense | BeaverTails (test) | Harmfulness Score1 | 44 | |
| Response Classification | BeaverTails V Text-Image Response | F1 Score84.8 | 39 | |
| Multi-label content safety classification | Beavertails | F1 Score0.86 | 35 | |
| Harmlessness evaluation | Beavertails | Helpful Score58.4 | 33 | |
| Safety | Beavertails | Violation Rate1.2 | 32 | |
| Adversarial Attack Robustness | BeaverTails | Attack Success Rate0 | 24 | |
| Value Alignment | BeaverTails (test) | Value Alignment Score59.9 | 24 | |
| Safety classification | BeaverTails (test) | AUC94 | 24 | |
| Multimodal Safety Evaluation | BeaverTails-V | Safety Score2.9 | 22 | |
| Safety Evaluation | Beavertails-V (test) | Helpfulness Score86.93 | 20 | |
| Safety Evaluation | BeaverTails Evaluation | Harmful Score (HS)1.24 | 20 | |
| Adversarial and Jailbreaking Attack Detection | BeaverTails | AUROC0.8525 | 20 | |
| Safety Evaluation | BeaverTails | ASR8.8 | 19 | |
| LLM Safety Evaluation | BeaverTails | Ssafe Score9.88 | 18 | |
| Safety and Helpfulness Evaluation | Beavertails | Safety Score92.03 | 18 | |
| Safety Evaluation | BeaverTails Text | Overall Score99.3 | 16 | |
| Safety Evaluation | BeaverTails (test) | Blue Reward (Higher)98 | 15 |