Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM Evaluation on Arena-Hard v0.1
Loading...
78.3
Arena-Hard Score
Qwen3-8B + CE-RM-4B
66.028
69.214
72.4
75.586
Jan 28, 2026
Arena-Hard Score
Score CI
Updated 4d ago
Evaluation Results
Method
Method
Links
Arena-Hard Score
Score CI
Qwen3-8B + CE-RM-4B
GRPO group size=4, Rew...
2026.01
78.3
-
Qwen3-8B + CE-RM-4B
GRPO group size=8, Rew...
2026.01
77.6
-
Qwen3-14B
Model Type=Base Policy...
2026.01
77.4
-
Qwen3-8B + CE-RM-4B
GRPO group size=4, Rew...
2026.01
75.7
-
Qwen3-8B + CompassJudger1-32B
GRPO group size=4, Rew...
2026.01
75
-
Qwen3-8B + CompassJudger1-32B
GRPO group size=8, Rew...
2026.01
74.7
-
Qwen3-8B + RM w/o unified criteria
GRPO group size=8, Rew...
2026.01
72.1
-
Qwen3-8B + RM w/o unified criteria
GRPO group size=4, Rew...
2026.01
71
-
Qwen3-8B
Model Type=Base Policy...
2026.01
66.5
-
Feedback
Search any
task
Search any
task