| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Evaluation | ATBench | Accuracy78.4 | 25 | |
| Trajectory-level safety evaluation | ATBench (test) | Accuracy0.928 | 20 | |
| Fine-grained risk diagnosis | ATBench | Risk Source Score75.2 | 19 | |
| Trajectory-safety classification | ATBench-C | Accuracy77.8 | 18 | |
| Safety Detection | ATBench-500 | Accuracy90 | 14 | |
| Trajectory-safety diagnosis | ATBench-F | R.S. Score49.2 | 14 | |
| Agent Safety Auditing | ATBench | Accuracy85.5 | 13 | |
| Real-world Harm Prediction | ATBench | Accuracy39 | 10 | |
| Failure Mode Prediction | ATBench | Accuracy41 | 10 | |
| Risk Source Prediction | ATBench | Accuracy52 | 10 | |
| Classification | ATBench (label-stratified) | AUROC0.784 | 4 | |
| Attack Detection | ATBench (label-stratified) | AUROC0.762 | 1 | |
| Safety Evaluation and Alignment | ATBench Family | Metric- | 0 |