Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Question Answering on PolicyBench US
Loading...
66.43
Accuracy
Deepseek-R1
57.6628
59.9389
62.215
64.4911
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Deepseek-R1
2026.04
66.43
o4-mini
2026.04
65.54
Claude-3.5-Sonnet
2026.04
65.39
Claude-3.7-sonnet
2026.04
65.06
Gemini-2.5-Flash
2026.04
64.03
GPT-4o
2026.04
61.41
Gemini-2.0-Flash
2026.04
60.84
Gemma 3-27B
2026.04
60.15
LLaMA-4
2026.04
60.04
Deepseek-V3
2026.04
59.38
QwQ-32B
2026.04
58
Feedback
Search any
task
Search any
task