Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Evaluation on PolicyBench Level 3 US
Loading...
77
Accuracy
o4-mini
65.9864
68.8457
71.705
74.5643
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
o4-mini
2026.04
77
Deepseek R1
2026.04
74.6
QwQ 32B
2026.04
69.9
Gemini 2.5
Model Variant=Gemini-2...
2026.04
69.44
Deepseek V3
2026.04
69.39
Claude 3.5
Model Variant=Claude-3...
2026.04
68.47
Gemma 3-27B
2026.04
68.37
Claude 3.7
Model Variant=Claude-3...
2026.04
68.28
GPT-4o
2026.04
68.13
Gemini 2.0
Model Variant=Gemini-2...
2026.04
66.55
LLaMA 4
2026.04
66.41
Feedback
Search any
task
Search any
task