Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Arena-Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM Alignment EvaluationArena-Hard
Win Rate42.7
67
Language Model Alignment EvaluationArena-Hard v0.1
Win Rate (%)35.2
36
General Instruction FollowingArena-Hard
Score22.1
35
Creative WritingArena-Hard Creative Writing v2
Score90.8
25
General Instruction FollowingArena-Hard v2
Score85.9
23
Instruction FollowingArena Hard v0.1
Score37.9
16
LLM EvaluationArena-Hard v2
Score18.2
14
Open-domain taskArena-Hard (test)
Error12.61
12
Open-domain taskArena-Hard
Error (%)5.17
12
Preference ModelingArena-Hard V2
Win Rate73.2
9
General Language Model EvaluationArena-Hard V2.0
Win Rate7.03
9
LLM EvaluationArena-Hard v0.1
Arena-Hard Score78.3
9
Chat PreferenceArena Hard v2
Score79.9
8
General WritingArena-Hard Creative Writing
Score93.6
6
General WritingArena-Hard Prompt
Score72.6
6
General ChatArena-Hard Style-Controlled
Win-rate46.1
5
General ChatArena-Hard Vanilla
Win Rate0.492
5
Reward Model EvaluationArena-Hard RU
Best@8 Score92.69
5
Open-ended text generationArena-hard Creative-Writing
Pairwise Win Rate80.2
4
Open-ended text generationArena-hard Hard-Prompt
Pairwise Win Rate58.5
4
Creative WritingArena Hard
Win Rate63.5
4
Human Preference EvaluationArena-Hard v0.1
Win Rate56.7
3
Showing 22 of 22 rows