| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Harmful question-answering | BeaverTails HarmfulQA (1k and 10k samples) | Avg Harmfulness Score0 | 63 | |
| Malicious Fine-tuning Defense | BeaverTails (test) | Harmfulness Score1 | 44 | |
| Harmful score evaluation | BeaverTails (test) | Harmful Score28.4 | 36 | |
| Multi-label content safety classification | Beavertails | F1 Score0.86 | 35 | |
| Harmlessness evaluation | Beavertails | Helpful Score58.4 | 33 | |
| Safety | Beavertails | Violation Rate1.2 | 32 | |
| Response Classification | BeaverTails V Text-Image Response | F1 Score84.02 | 23 | |
| Adversarial and Jailbreaking Attack Detection | BeaverTails | AUROC0.8525 | 20 | |
| Text-based safety moderation | BeaverTails | F1 Score87.3 | 19 | |
| Response Harmfulness Detection | BeaverTails | F1 Score89.9 | 18 | |
| Safety Evaluation | BeaverTails Text | Overall Score99.3 | 16 | |
| Safety Evaluation | BeaverTails Audio 1K | RSR98.95 | 12 | |
| Prompt Classification | BeaverTails-V Text-Image Prompt | F1 Score88.36 | 7 | |
| Unsafe content categorization | BeaverTails V | Accuracy63.44 | 6 | |
| LLM Personalization | BeaverTails (test) | Personalization Win Rate72.1 | 6 | |
| Out-of-Taxonomy Risk Detection | BeaverTails V | F1 Score0.6314 | 4 | |
| Safety Evaluation | Beavertails-A Audio | Safety Score0.997 | 4 | |
| LLM Safety Defense | BeaverTails | ASR10.39 | 4 | |
| Safety Reasoning Evaluation | BeaverTails 5,000 prompts (subsampled) | Relevance4.68 | 2 | |
| OOD safety category inference (Stage 2) | BeaverTails V | Reward Mean- | 0 |