Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Human-centric Quality Evaluation on Arena-Hard
Loading...
28.8
Arena-Hard Score
Instruct
12.16
16.48
20.8
25.12
May 28, 2026
Arena-Hard Score
Updated 5d ago
Evaluation Results
Method
Method
Links
Arena-Hard Score
Instruct
Model=Qwen3-4B
2026.05
28.8
REDIPO
Model=Qwen3-4B
2026.05
28.5
DPO
Model=Qwen3-4B
2026.05
28.2
DivPO
Model=Qwen3-4B
2026.05
28.2
Base
Model=Qwen3-4B
2026.05
19.5
Instruct
Model=LLaMA-3.1-8B
2026.05
19.3
REDIPO
Model=LLaMA-3.1-8B
2026.05
18.6
DPO
Model=LLaMA-3.1-8B
2026.05
18.3
DivPO
Model=LLaMA-3.1-8B
2026.05
18.1
Base
Model=OLMo-3-7B
2026.05
14.6
Instruct
Model=OLMo-3-7B
2026.05
14
DPO
Model=OLMo-3-7B
2026.05
14
REDIPO
Model=OLMo-3-7B
2026.05
14
Base
Model=LLaMA-3.1-8B
2026.05
13.4
DivPO
Model=OLMo-3-7B
2026.05
12.8
Feedback
Search any
task
Search any
task