| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Harmful question-answering | BeaverTails HarmfulQA (1k and 10k samples) | Avg Harmfulness Score0 | 63 | |
| Text-based safety moderation | BeaverTails | F1 Score87.3 | 46 | |
| Malicious Fine-tuning Defense | BeaverTails (test) | Harmfulness Score1 | 44 | |
| Response Classification | BeaverTails V Text-Image Response | F1 Score84.8 | 39 | |
| Harmful score evaluation | BeaverTails (test) | Harmful Score28.4 | 36 | |
| Multi-label content safety classification | Beavertails | F1 Score0.86 | 35 | |
| Harmlessness evaluation | Beavertails | Helpful Score58.4 | 33 | |
| Safety | Beavertails | Violation Rate1.2 | 32 | |
| Multimodal Safety Evaluation | BeaverTails-V | Safety Score2.9 | 22 | |
| Safety Evaluation | Beavertails-V (test) | Helpfulness Score86.93 | 20 | |
| Safety Evaluation | BeaverTails Evaluation | Harmful Score (HS)1.24 | 20 | |
| Adversarial and Jailbreaking Attack Detection | BeaverTails | AUROC0.8525 | 20 | |
| Response Harmfulness Detection | BeaverTails | F1 Score89.9 | 18 | |
| Safety Evaluation | BeaverTails Text | Overall Score99.3 | 16 | |
| Safety Evaluation | BeaverTails (test) | Blue Reward (Higher)98 | 15 | |
| Safety Alignment | BeaverTails V | Safety Score93.37 | 13 | |
| Safety Evaluation | BeaverTails Audio 1K | RSR98.95 | 12 | |
| Tracing | Beavertails (test) | Tracing Success Rate (TSR)99.29 | 10 | |
| Prompt Classification | BeaverTails-V Text-Image Prompt | F1 Score88.36 | 7 | |
| Unsafe content categorization | BeaverTails V | Accuracy63.44 | 6 | |
| LLM Personalization | BeaverTails (test) | Personalization Win Rate72.1 | 6 | |
| Safety Monitoring | BeaverTails | ASR @ 1% FPR19.6 | 5 | |
| Safety Evaluation | BeaverTails | Safety Score76.7 | 4 | |
| Out-of-Taxonomy Risk Detection | BeaverTails V | F1 Score0.6314 | 4 | |
| Safety Evaluation | Beavertails-A Audio | Safety Score0.997 | 4 |