Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Evaluation on PolicyBench Level 3 CN
Loading...
80.34
Accuracy
QwQ 32B
68.068
71.254
74.44
77.626
Apr 14, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
QwQ 32B
2026.04
80.34
o4-mini
2026.04
79.49
Gemini 2.5
Model Variant=Gemini-2...
2026.04
76.18
Claude 3.7
Model Variant=Claude-3...
2026.04
73.82
Gemini 2.0
Model Variant=Gemini-2...
2026.04
73.8
Deepseek R1
2026.04
73.78
Claude 3.5
Model Variant=Claude-3...
2026.04
72.83
Deepseek V3
2026.04
72.33
Gemma 3-27B
2026.04
71.51
GPT-4o
2026.04
70.24
LLaMA 4
2026.04
68.54
Feedback
Search any
task
Search any
task