Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Downstream Policy Performance on Arena-Hard v2.0
Loading...
33.9
Win Rate
GRPO + Improved RM
-0.004
8.798
17.6
26.402
May 29, 2026
Win Rate
Updated 2d ago
Evaluation Results
Method
Method
Links
Win Rate
GRPO + Improved RM
Policy Model=Qwen3-4B-...
2026.05
33.9
GRPO + SAVE
Policy Model=Qwen3-4B-...
2026.05
33.5
GRPO + R2M
Policy Model=Qwen3-4B-...
2026.05
32.6
SFT
Policy Model=Qwen3-4B-...
2026.05
31.9
GRPO
Policy Model=Qwen3-4B-...
2026.05
30.2
REINFORCE++
Policy Model=Qwen3-4B-...
2026.05
28.6
PRIME
Policy Model=Qwen3-4B-...
2026.05
22.7
GRPO + Improved RM
Policy Model=Qwen2.5-3...
2026.05
2.6
GRPO
Policy Model=Qwen2.5-3...
2026.05
2.2
GRPO + SAVE
Policy Model=Qwen2.5-3...
2026.05
2.2
REINFORCE++
Policy Model=Qwen2.5-3...
2026.05
2.1
SFT
Policy Model=Qwen2.5-3...
2026.05
2
GRPO + R2M
Policy Model=Qwen2.5-3...
2026.05
1.9
PRIME
Policy Model=Qwen2.5-3...
2026.05
1.3
Feedback
Search any
task
Search any
task