| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| ImpossibleBench | Average Precision100 | 40 | 4d ago | ||
| TRACE | Average Precision90.1 | 28 | 4d ago | ||
| BigCodeBench Sabotage (reasoning LLM attacker) | Extract-and-Evaluate | log-AUROC0.87 | 8 | 1mo ago | |
| BigCodeBench-Sabotage traditional LLM attacker | Extract-and-Evaluate | log-AUROC0.84 | 8 | 1mo ago | |
| MLE-Sabotage | Action-only | log-AUROC0.87 | 8 | 1mo ago | |
| SHADE-Arena | CoT+action | log-AUROC78 | 8 | 1mo ago |