Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Evaluation on PolicyBench Level 1 (CN)
Loading...
62.02
Accuracy
Deepseek R1
40.9392
46.4121
51.885
57.3579
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Deepseek R1
2026.04
62.02
QwQ 32B
2026.04
55.87
Claude 3.7
Model Variant=Claude-3...
2026.04
55.29
Gemini 2.5
Model Variant=Gemini-2...
2026.04
54.06
Claude 3.5
Model Variant=Claude-3...
2026.04
53.77
LLaMA 4
2026.04
49.81
Deepseek V3
2026.04
48.61
Gemini 2.0
Model Variant=Gemini-2...
2026.04
47.87
GPT-4o
2026.04
46.01
o4-mini
2026.04
45.93
Gemma 3-27B
2026.04
41.75
Feedback
Search any
task
Search any
task