| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| DynaBench (test) | Activation-Space Whitening | F1 Score86 | 12 | 3mo ago | |
| PhantomPolicy complete benchmark world-model coverage (human-reviewed trace labels) | Sentinel | True Positives (TP)58 | 5 | 1mo ago | |
| PhantomPolicy safe-control original | Violation Rate0.0333 | 5 | 1mo ago | ||
| PhantomPolicy original (violation-ground-truth) | Violated Count54 | 5 | 1mo ago |