Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ArenaHard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reverse Chain-of-Thought GenerationArenaHard
Score72
20
LLM Chat EvaluationArenaHard
Accuracy49.2
17
Open-ended WritingArenaHard
Accuracy50
17
Instruction Following EvaluationArenaHard v1
ArenaHardv1 Score38
14
Creative WritingArenaHard creative writing v2.0
WR Score29
13
Instruction FollowingArenaHard Creative Writing 2.0
Win Rate61.9
12
Instruction FollowingArenaHard Hard Prompts 2.0
Win Rate32.7
12
General ChatArenaHard v2.0
Win Rate52
12
General ChatArenaHard v1.0
Win Rate82.75
12
General Reasoning and Creative WritingArenaHard v2
Hard Prompt Score15.5
8
AlignmentArenaHard
pass@195.7
7
Chatbot EvaluationArenaHard v2
ArenaHard v2 Score57.4
6
Human Preference AlignmentArenaHard V2
Avg@3 Score60
6
Alignment & Instruction FollowingArenaHard Hard Prompt v2
Pass@188.2
4
Chatbot EvaluationArenaHard
Win Rate13.88
3
Writing and Arena EvaluationArenaHard v2
ArenaHard-v2 Creative Accuracy89
3
Alignment & Instruction FollowingArenaHard Creative Writing v2
Pass@178.7
3
Alignment & Instruction FollowingArenaHard Avg. v2
Pass@183.5
3
Showing 18 of 18 rows