| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Planning Security and Autonomy | WASP Reddit (test) | Attack Success Rate0 | 8 | |
| Agent Planning Security and Autonomy | WASP GitLab (test) | Attack Success Rate29.2 | 8 | |
| Computer-Using Agent Task | WASP 1.0 (test) | PCR97.6 | 5 | |
| Label Prediction | WASP | Accuracy90.6 | 4 |