Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Violation Detection on PhantomPolicy original (violation-ground-truth)
Loading...
54
Violated Count
Claude Sonnet 4.6
53.8
55.15
56.5
57.85
Apr 14, 2026
Violated Count
Self-Avoided Count
Updated 1mo ago
Evaluation Results
Method
Method
Links
Violated Count
Self-Avoided Count
Claude Sonnet 4.6
Condition=Execution-or...
2026.04
54
6
Claude Opus 4.6
Condition=Execution-or...
2026.04
56
4
GPT-5.4
Condition=Execution-or...
2026.04
58
2
GPT-5 mini
Condition=Execution-or...
2026.04
59
1
GPT-5.4 nano
Condition=Execution-or...
2026.04
59
1
Feedback
Search any
task
Search any
task