Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Evaluation on PolicyBench Level 2 (CN)
Loading...
62.92
Accuracy
Deepseek R1
55.2136
57.2143
59.215
61.2157
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Deepseek R1
2026.04
62.92
Gemini 2.5
Model Variant=Gemini-2...
2026.04
60.57
Claude 3.7
Model Variant=Claude-3...
2026.04
60.47
QwQ 32B
2026.04
59.79
Claude 3.5
Model Variant=Claude-3...
2026.04
59.74
LLaMA 4
2026.04
56.56
Gemini 2.0
Model Variant=Gemini-2...
2026.04
56.39
GPT-4o
2026.04
56.34
o4-mini
2026.04
55.81
Gemma 3-27B
2026.04
55.56
Deepseek V3
2026.04
55.51
Feedback
Search any
task
Search any
task