Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Arena-Hard

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM Alignment EvaluationArena-Hard
Win Rate42.7
73
General Instruction FollowingArena-Hard
Score22.1
46
General Instruction FollowingArena-Hard v2
Score85.9
37
Language Model Alignment EvaluationArena-Hard v0.1
Win Rate (%)35.2
36
LLM Alignment EvaluationArena-Hard v0.1
Win Rate50
31
Creative WritingArena-Hard Creative Writing v2
Score90.8
25
Instruction FollowingArena-Hard Vanilla
Instruction Following Score57.5
19
Creative WritingArena Hard
Win Rate63.5
18
Instruction FollowingArena-Hard Style-Con
Score57.7
17
General Chat EvaluationArena-Hard
Win Rate84
16
Instruction FollowingArena Hard v0.1
Score37.9
16
Downstream Policy PerformanceArena-Hard v2.0
Win Rate33.9
14
LLM EvaluationArena-Hard v2
Score18.2
14
Complex reasoningArena-Hard 2.0 (test)
Overall Accuracy52.9
12
Open-domain taskArena-Hard (test)
Error12.61
12
Open-domain taskArena-Hard
Error (%)5.17
12
Conversational Skill EvaluationArena-Hard
Win Rate (%)32.6
11
Chat PreferenceArena Hard v2
Score79.9
10
Chat Quality EvaluationArena-Hard vs gpt-4-0314 (test)
Win Rate57.6
9
Preference ModelingArena-Hard V2
Win Rate73.2
9
General Language Model EvaluationArena-Hard V2.0
Win Rate7.03
9
LLM EvaluationArena-Hard v0.1
Arena-Hard Score78.3
9
Open-ended GenerationArena-Hard v2.0
Score47.8
8
Instruction followingArena-Hard v2 (test)
AH2 Score1.3
8
LLM Chat EvaluationArena-Hard v0.1 (test)
Win Rate40.1
6
Showing 25 of 39 rows