Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Arena

Benchmarks

Task NameDataset NameSOTA ResultTrend
Instruction FollowingArena Hard
Win Rate98.11
263
LLM-as-a-judgeARENA
Accuracy66.07
20
Conversational versatilityArena-Hard
Win Rate61.16
20
Textual understandingArena-Hard
Win Rate67.1
17
Image EditingArena Analysis March 26, 2026 (test)
Arena ELO1,270
16
Human-centric Quality EvaluationArena-Hard
Arena-Hard Score28.8
15
Open-ended GenerationArena-Hard
Score84.6
14
Pluralistic Reward Model LearningARENA
Accuracy (ARENA)60.56
10
Technical problem-solvingArena Hard
Win Rate52.3
10
Personalized DialogueArena-Hard
Arena Win Rate60
7
Preference-based GenerationArena CW
Score36.5
6
Preference-based GenerationArena HP
Score24.8
6
AlignmentArena-Hard
Score48.1
5
Reward ModelingArena100K (test)
Table MSE0.2311
4
AlignmentArena-Hard
Hard Prompt Gemini Score70.4
4
Human Preference EvaluationArena (Phase 2)
Total Battles200
3
Human Preference EvaluationArena Creative Writing
Win Rate23.4
3
Preference PredictionArena
Count of Significant Features (S)7
2
Showing 18 of 18 rows