| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BigCodeBench Sabotage (reasoning LLM attacker) | Extract-and-Evaluate | log-AUROC0.87 | 8 | 4d ago | |
| BigCodeBench-Sabotage traditional LLM attacker | Extract-and-Evaluate | log-AUROC0.84 | 8 | 4d ago | |
| MLE-Sabotage | Action-only | log-AUROC0.87 | 8 | 4d ago | |
| SHADE-Arena | CoT+action | log-AUROC78 | 8 | 4d ago |