| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Policy Violation Detection | PhantomPolicy complete benchmark world-model coverage (human-reviewed trace labels) | True Positives (TP)58 | 5 | |
| Policy Violation Detection | PhantomPolicy safe-control original | Violation Rate0.0333 | 5 | |
| Policy Violation Detection | PhantomPolicy original (violation-ground-truth) | Violated Count54 | 5 | |
| Policy Enforcement | PhantomPolicy (human-reviewed trace labels) | Risky-case Violation Rate40.7 | 2 |