Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ArenaHard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reverse Chain-of-Thought GenerationArenaHard
Score72
20
Instruction Following EvaluationArenaHard v1
ArenaHardv1 Score38
14
Creative WritingArenaHard creative writing v2.0
WR Score29
13
Instruction FollowingArenaHard Creative Writing 2.0
Win Rate61.9
12
Instruction FollowingArenaHard Hard Prompts 2.0
Win Rate32.7
12
General ChatArenaHard v2.0
Win Rate52
12
General ChatArenaHard v1.0
Win Rate82.75
12
General Reasoning and Creative WritingArenaHard v2
Hard Prompt Score15.5
8
AlignmentArenaHard
pass@195.7
7
Human Preference AlignmentArenaHard V2
Avg@3 Score60
6
Alignment & Instruction FollowingArenaHard Hard Prompt v2
Pass@188.2
4
Chatbot EvaluationArenaHard v2
Hard Prompt Accuracy14
4
Alignment & Instruction FollowingArenaHard Creative Writing v2
Pass@178.7
3
Alignment & Instruction FollowingArenaHard Avg. v2
Pass@183.5
3
Showing 14 of 14 rows