Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM Evaluation on Arena-Hard v2
Loading...
18.2
Score
Qwen3-8B + CE-RM-4B
-0.631696
4.257302
9.1463
14.035298
Jan 14, 2026
Jan 16, 2026
Jan 18, 2026
Jan 21, 2026
Jan 23, 2026
Jan 25, 2026
Jan 28, 2026
Score
Confidence Interval
Updated 4d ago
Evaluation Results
Method
Method
Links
Score
Confidence Interval
Qwen3-8B + CE-RM-4B
GRPO group size=8, Rew...
2026.01
18.2
-
Qwen3-8B + CE-RM-4B
GRPO group size=4, Rew...
2026.01
17.6
-
Qwen3-14B
Model Type=Base Policy...
2026.01
17.1
-
Qwen3-8B + CE-RM-4B
GRPO group size=4, Rew...
2026.01
16.3
-
Qwen3-8B + CompassJudger1-32B
GRPO group size=8, Rew...
2026.01
13.6
-
Qwen3-8B + RM w/o unified criteria
GRPO group size=8, Rew...
2026.01
13.5
-
Qwen3-8B + CompassJudger1-32B
GRPO group size=4, Rew...
2026.01
13.4
-
Qwen3-8B + RM w/o unified criteria
GRPO group size=4, Rew...
2026.01
12.9
-
Qwen3-8B
Model Type=Base Policy...
2026.01
9.8
-
STEP3-VL-10B
Number of Parameters=10B
2026.01
0.5857
-
Qwen3-VL Thinking
Number of Parameters=8B
2026.01
0.4734
-
MiMo-VL RL-2508
Number of Parameters=7B
2026.01
0.2859
-
InternVL 3.5
Number of Parameters=8B
2026.01
0.1557
-
GLM-4.6V Flash
Number of Parameters=9B
2026.01
0.0926
-
Feedback
Search any
task
Search any
task