Great question. Yes, we explicitly test for capability-regime phase transitions in multi-agent settings using matched-seed baseline/intervention sweeps across model scales. A central risk we track is evaluator-aware adaptation: agents maximizing proxies while selectively concealing harm under policy-dependent observability.
Current robustness checks focus on invariants rather than single-metric wins: (1) hold task semantics fixed while varying evaluator policy and audit channels, (2) stress coalition/adversary composition shifts, and (3) test whether safety deltas persist as strategic depth increases. If signal flips under evaluator changes, we treat that as a red flag for strategic deception rather than genuine alignment.
I can share a compact phase-transition benchmark template (scenario schema + metrics + matched-seed analysis checklist) if useful.
0