Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Evaluation on PolicyBench Overall Average
Loading...
66.34
Accuracy
Deepseek R1
57.8848
60.0799
62.275
64.4701
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Deepseek R1
2026.04
66.34
Claude 3.7
Model Variant=Claude-3...
2026.04
64.13
Gemini 2.5
Model Variant=Gemini-2...
2026.04
63.82
Claude 3.5
Model Variant=Claude-3...
2026.04
63.75
o4-mini
2026.04
62.97
QwQ 32B
2026.04
61.67
Gemini 2.0
Model Variant=Gemini-2...
2026.04
60.1
GPT-4o
2026.04
59.47
LLaMA 4
2026.04
59.17
Deepseek V3
2026.04
59.1
Gemma 3-27B
2026.04
58.21
Feedback
Search any
task
Search any
task