Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Arena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Instruction FollowingArena Hard
Win Rate98.11
103
LLM-as-a-judgeARENA
Accuracy66.07
20
Conversational versatilityArena-Hard
Win Rate61.16
20
Image EditingArena Analysis March 26, 2026 (test)
Arena ELO1,270
16
Open-ended GenerationArena-Hard
Score84.6
14
Pluralistic Reward Model LearningARENA
Accuracy (ARENA)60.56
10
Technical problem-solvingArena Hard
Win Rate52.3
10
AlignmentArena-Hard
Score48.1
5
AlignmentArena-Hard
Hard Prompt Gemini Score70.4
4
Human Preference EvaluationArena Creative Writing
Win Rate23.4
3
Preference PredictionArena
Count of Significant Features (S)7
2
Showing 11 of 11 rows